Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments基于混合合作竞争环境的多代理演员评论家算法
Abstract
We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case:Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows.We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies.We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.
我们探索了针对多主体领域的深度强化学习方法。 我们首先分析在多智能体情况下传统算法的难点:Q学习受到环境固有的非平稳性的挑战,而策略梯度则受到随着智能体数量增长而增加的方差的困扰。 一种行为者批判方法的改编,它考虑了其他代理的行动策略,并且能够成功地学习需要复杂的多代理协调的策略。 此外,我们介绍了一种针对每个代理使用策略集合的培训方案,这导致了更强大的多代理策略。我们展示了与合作和竞争场景下现有方法相比,我们的方法的优势,在这种情况下,代理群体能够 发现各种物理和信息协调策略。
1 Introduction
Reinforcement learning (RL) has recently been applied to solve challenging problems, from game playing [24, 29] to robotics [18]. In industrial applications, RL is emerging as a practical component in large scale systems such as data center cooling [1]. Most of the successes of RL have been in single agent domains, where modelling or predicting the behaviour of other actors in the environment is largely unnecessary. However, there are a number of important applications that involve interaction between multiple agents, where emergent behavior and complexity arise from agents co-evolving together. For example, multi-robot control [21], the discovery of communication and language [31, 8, 25], multiplayer games [28], and the analysis of social dilemmas [17] all operate in a multi-agent domain. Related problems, such as variants of hierarchical reinforcement learning [6] can also be seen as a multi-agent system, with multiple levels of hierarchy being equivalent to multiple agents. Additionally, multi-agent self-play has recently been shown to be a useful training paradigm [29, 32]. Successfully scaling RL to environments with multiple agents is crucial to building artificially intelligent systems that can productively interact with humans and each other. Unfortunately, traditional reinforcement learning approaches such as Q-Learning or policy gradient are poorly suited to multi-agent environments. One issue is that each agent’s policy is changing as training progresses, and the environment becomes non-stationary from the perspective of any individual agent (in a way that is not explainable by changes in the agent’s own policy). This presents learning stability challenges and prevents the straightforward use of past experience replay, which is crucial for stabilizing deep Q-learning. Policy gradient methods, on the other hand, usually exhibit very high variance when coordination of multiple agents is required. Alternatively, one can use model-based policy optimization which can learn optimal policies via back-propagation, but this requires a (differentiable) model of the world dynamics and assumptions about the interactions between agents. Applying these methods to competitive environments is also challenging from an optimization perspective, as evidenced by the notorious instability of adversarial training methods [11].
强化学习(RL)最近已用于解决挑战性问题,从游戏[24,29]到机器人技术[18]。在工业应用中,RL正在作为数据中心冷却等大型系统中的实用组件出现[1]。 RL的大多数成功都发生在单一代理域中,在该域中,建模或预测环境中其他参与者的行为在很大程度上是不必要的。但是,有许多重要的应用程序涉及多个代理之间的交互,其中新兴的行为和复杂性是由代理一起共同发展引起的。例如,多机器人控制[21],通信和语言的发现[31、8、25],多人游戏[28]以及社会困境分析[17]都在多代理域中进行。相关问题,例如层次强化学习的变体[6],也可以看作是一个多主体系统,其中多层层次等同于多个主体。另外,最近已经证明多主体自玩是一种有用的训练范例[29,32]。成功地将RL扩展到具有多个代理的环境对于构建可以与人类以及彼此进行有效交互的人工智能系统至关重要。不幸的是,传统的强化学习方法(例如Q学习或策略梯度)不适用于多主体环境。一个问题是,随着培训的进行,每个座席的政策都在变化,并且从任何一个单独的座席的角度来看,环境都变得不稳定(这种方式无法由座席自身政策的改变来解释)。这带来了学习稳定性方面的挑战,并阻止了以往经验重放的直接使用,这对于稳定深度Q学习至关重要。另一方面,当需要多个代理协调时,策略梯度方法通常会表现出很大的差异。或者,可以使用基于模型的策略优化,该模型可以通过反向传播学习最佳策略,但这需要世界动态模型(可微分)和有关代理之间相互作用的假设。从优化的角度来看,将这些方法应用于竞争环境也具有挑战性,对抗性训练方法臭名昭著的不稳定性证明了这一点[11]。
In this work, we propose a general-purpose multi-agent learning algorithm that: (1) leads to learned policies that only use local information (i.e. their own observations) at execution time, (2) does not assume a differentiable model of the environment dynamics or any particular structure on the
communication method between agents, and (3) is applicable not only to cooperative interaction but to competitive or mixed interaction involving both physical and communicative behavior. The ability to act in mixed cooperative-competitive environments may be critical for intelligent agents; while competitive training provides a natural curriculum for learning [32], agents must also exhibit cooperative behavior (e.g. with humans) at execution time. We adopt the framework of centralized training with decentralized execution, allowing the policies to use extra information to ease training, so long as this information is not used at test time. It is unnatural to do this with Q-learning without making additional assumptions about the structure of the environment, as the Q function generally cannot contain different information at training and test time. Thus, we propose a simple extension of actor-critic policy gradient methods where the critic is augmented with extra information about the policies of other agents, while the actor only has access to local information. After training is completed, only the local actors are used at execution phase,
acting in a decentralized manner and equally applicable in cooperative and competitive settings.
在这项工作中,我们提出了一种通用的多主体学习算法,该算法:(1)导致在执行时仅使用本地信息(即他们自己的观察结果)的学习策略,(2)不假设环境动力学或任何特定结构(3)不仅适用于合作互动,而且适用于涉及身体和交流行为的竞争性或混合性互动。在混合合作竞争环境中行动的能力对于智能代理至关重要。虽然竞争性培训提供了自然的学习课程[32],但代理商也必须在执行时表现出合作行为(例如与人的行为)。我们采用集中式培训和分散执行的框架,允许策略使用额外的信息来简化培训,只要在测试时不使用此信息即可。在没有对环境结构做其他假设的情况下,使用Q学习进行此操作是不自然的,因为Q函数通常在训练和测试时不能包含不同的信息。因此,我们提出了行为者批评策略梯度方法的一种简单扩展,其中,批评者被增加了有关其他代理人政策的额外信息,而行为者只能访问本地信息。培训结束后,执行阶段仅使用本地参与者,以分散的方式行事,同样适用于合作和竞争环境。
Since the centralized critic function explicitly uses the decision-making policies of other agents, we additionally show that agents can learn approximate models of other agents online and effectively use them in their own policy learning procedure. We also introduce a method to improve the stability of multi-agent policies by training agents with an ensemble of policies, thus requiring robust interaction with a variety of collaborator and competitor policies. We empirically show the success of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover complex physical and communicative coordination strategies.
由于集中评论者功能明确使用其他代理的决策策略,因此我们进一步表明,代理可以在线学习其他代理的近似模型,并在他们自己的策略学习过程中有效地使用它们。 我们还介绍了一种通过培训具有整体策略的代理来提高多代理策略稳定性的方法,因此需要与各种协作者策略和竞争者策略进行稳健的交互。 我们从经验上展示了与合作和竞争方案中的现有方法相比,我们的方法的成功,在现有方案中,代理商群体能够发现复杂的身体和沟通协调策略。
2 Related Work
The simplest approach to learning in multi-agent settings is to use independently learning agents. This was attempted with Q-learning in [36], but does not perform well in practice [23]. As we will show, independently-learning policy gradient methods also perform poorly. One issue is that each agent’s policy changes during training, resulting in a non-stationary environment and preventing the naïve application of experience replay. Previous work has attempted to address this by inputting other agent’s policy parameters to the Q function [37], explicitly adding the iteration index to the replay buffer, or using importance sampling [9]. Deep Q-learning approaches have previously been investigated in [35] to train competing Pong agents. The nature of interaction between agents can either be cooperative, competitive, or both and many algorithms are designed only for a particular nature of interaction. Most studied are cooperative settings, with strategies such as optimistic and hysteretic Q function updates [15, 22, 26], which assume that the actions of other agents are made to improve collective reward. Another approach is to indirectly arrive at cooperation via sharing of policy parameters [12], but this requires homogeneous agent capabilities. These algorithms are generally not applicable in competitive or mixed settings. See [27, 4] for surveys of multi-agent learning approaches and applications.
在多主体设置中学习的最简单方法是使用独立的学习主体。尝试在[36]中进行Q学习,但在实践中效果不佳[23]。正如我们将显示的那样,独立学习策略梯度方法的效果也很差。一个问题是,每个座席的政策在培训过程中都会发生变化,从而导致不稳定的环境,并阻止幼稚地应用经验重播。先前的工作试图通过向Q函数输入其他代理的策略参数来解决此问题[37],将重现索引显式添加到重播缓冲区,或使用重要性采样[9]。先前在[35]中已经研究了深度Q学习方法来训练竞争的Pong代理。代理之间交互的性质可以是合作,竞争或两者兼而有之,并且许多算法仅针对特定的交互性质而设计。大部分研究都是在合作环境中进行的,例如乐观和滞后Q函数更新[15、22、26]等策略,这些策略假定其他行为者的行为是为了提高集体奖励。另一种方法是通过共享策略参数来间接达成合作[12],但这需要同质的代理功能。这些算法通常不适用于竞争性或混合设置。有关多主体学习方法和应用的调查,请参见[27,4]。
Concurrently to our work, [7] proposed a similar idea of using policy gradient methods with a centralized critic, and test their approach on a StarCraft micromanagement task. Their approach differs from ours in the following ways: (1) they learn a single centralized critic for all agents, whereas we learn a centralized critic for each agent, allowing for agents with differing reward functions including competitive scenarios, (2) we consider environments with explicit communication between agents, (3) they combine recurrent policies with feed-forward critics, whereas our experiments use feed-forward policies (although our methods are applicable to recurrent policies), (4) we learn continuous policies whereas they learn discrete policies.
与我们的工作同时,[7]提出了类似的想法,即使用带有集中批评者的策略渐变方法,并在StarCraft微管理任务上测试其方法。 他们的方法在以下方面与我们的方法不同:(1)他们为所有代理商学习一个集中的批评者,而我们为每个代理商学习一个集中的批评者,从而允许代理商具有不同的奖励功能,包括竞争场景;(2)我们考虑环境 通过代理之间的明确沟通,(3)他们将前期策略与前馈评论家结合起来,而我们的实验使用前馈策略(尽管我们的方法适用于前期策略),(4)我们学习连续策略,而他们学习离散策略 。
Recent work has focused on learning grounded cooperative communication protocols between agents to solve various tasks [31, 8, 25]. However, these methods are usually only applicable when the communication between agents is carried out over a dedicated, differentiable communication channel. Our method requires explicitly modeling decision-making process of other agents. The importance of such modeling has been recognized by both reinforcement learning [3, 5] and cognitive science communities [10]. [13] stressed the importance of being robust to the decision making process of other agents, as do others by building Bayesian models of decision making. We incorporate such robustness considerations by requiring that agents interact successfully with an ensemble of any possible policies of other agents, improving training stability and robustness of agents after training.
最近的工作集中在学习代理之间的基础协作通信协议以解决各种任务[31、8、25]。 但是,这些方法通常仅在代理之间的通信是通过专用的,可区分的通信信道执行时才适用。 我们的方法需要明确建模其他代理商的决策过程。 强化学习[3,5]和认知科学界[10]都已经意识到这种建模的重要性。 [13]强调了对其他主体的决策过程保持鲁棒性的重要性,就像其他人通过建立贝叶斯决策模型一样。 通过要求代理与其他代理的任何可能策略的整体成功交互,并在训练后提高代理的训练稳定性和鲁棒性,我们纳入了这种鲁棒性考虑。
3 Background
Markov Games In this work, we consider a multi-agent extension of Markov decision processes (MDPs) called partially observable Markov games [20]. A Markov game for N agents is defined by a
马尔可夫博弈在这项工作中,我们考虑了马尔可夫决策过程(MDP)的多主体扩展,称为部分可观察的马尔可夫博弈[20]。 N个特工的马尔可夫博弈由
Q-Learning and Deep Q-Networks (DQN). Q-Learning and DQN [24] are popular methods in reinforcement learning and have been previously applied to multi-agent settings [8, 37]. Q-Learning
Q学习和深度Q网络(DQN)。 Q学习和DQN [24]是强化学习中流行的方法,并且先前已应用于多主体设置[8,37]。 Q学习
where Q¯ is a target Q function whose parameters are periodically updated with the most recent θ, which helps stabilize learning. Another crucial component of stabilizing DQN is the use of an experience replay buffer D containing tuples (s, a, r, s0 ).
Q-Learning can be directly applied to multi-agent settings by having each agent i learn an independently optimal function Qi [36]. However, because agents are independently updating their policies as learning progresses, the environment appears non-stationary from the view of any one agent, violating Markov assumptions required for convergence of Q-learning. Another difficulty observed in [9] is that the experience replay buffer cannot be used in such a setting since in general
其中Q是目标Q函数,其参数会以最新的θ定期更新,这有助于稳定学习。 稳定DQN的另一个重要组成部分是使用包含元组(s,a,r,s0)的体验重播缓冲区D。
通过让每个智能体学习独立的最优函数Qi [36],可以将Q-Learning直接应用于多智能体设置。 但是,由于代理随着学习进度独立地更新其策略,因此从任何一个代理的角度来看,环境似乎都是不稳定的,这违反了Q学习收敛所需的马尔可夫假设。 在[9]中观察到的另一个困难是,经验重播缓冲区不能用于这种设置,因为通常
Policy gradient methods are known to exhibit high variance gradient estimates. This is exacerbated in multi-agent settings; since an agent’s reward usually depends on the actions of many agents, the reward conditioned only on the agent’s own actions (when the actions of other agents are not
considered in the agent’s optimization process) exhibits much more variability, thereby increasing the variance of its gradients. Below, we show a simple setting where the probability of taking a gradient step in the correct direction decreases exponentially with the number of agents.
已知策略梯度方法表现出高方差梯度估计。 在多代理设置中,这种情况会加剧。 由于代理商的奖励通常取决于许多代理商的行为,因此奖励仅取决于代理商自身的行为(当其他代理商的行为不是
代理商的优化过程中考虑的因素)具有更大的可变性,从而增加了其梯度的差异。 在下面,我们显示了一个简单的设置,其中沿代理人的数量呈指数级降低的正确方向采取梯度步骤的可能性。
The use of baselines, such as value function baselines typically used to ameliorate high variance, is problematic in multi-agent settings due to the non-stationarity issues mentioned previously.
Deterministic Policy Gradient (DPG) Algorithms. It is also possible to extend the policy gradient framework to deterministic policies µθ : S 7→ A [30]. In particular, under certain conditions we can
由于前面提到的非平稳性问题,在多主体设置中使用基线(例如通常用于改善高方差的值函数基线)存在问题。
确定性策略梯度(DPG)算法。 也可以将策略梯度框架扩展到确定性策略µθ:S 7→A [30]。 特别是在某些条件下,我们可以
Since this theorem relies on ∇aQµ(s, a), it requires that the action space A (and thus the policy µ) be continuous.
Deep deterministic policy gradient (DDPG) [19] is a variant of DPG where the policy µ and critic Qµ are approximated with deep neural networks. DDPG is an off-policy algorithm, and samples trajectories from a replay buffer of experiences that are stored throughout training. DDPG also makes use of a target network, as in DQN [24].
由于该定理依赖于∇aQµ(s,a),因此它要求动作空间A(以及策略µ)是连续的。
深度确定性策略梯度(DDPG)[19]是DPG的一种变体,其中策略µ和评论者Qµ用深度神经网络近似。 DDPG是一种偏离策略的算法,它从在整个训练过程中存储的经验重播缓冲区中采样轨迹。 DDPG也像DQN [24]一样利用目标网络。
4 Methods
4.1 Multi-Agent Actor Critic
We have argued in the previous section that naïve policy gradient methods perform poorly in simple multi-agent settings, and this is supported in our experiments in Section 5. Our goal in this section is to derive an algorithm that works well in such settings. However, we would like to operate under the following constraints: (1) the learned policies can only use local information (i.e. their own observations) at execution time, (2) we do not assume a differentiable model of the environment dynamics, unlike in [25], and (3) we do not assume any particular structure on the communication method between agents (that is, we don’t assume a differentiable communication channel). Fulfilling the above desiderata would provide a general-purpose multi-agent learning algorithm that could be applied not just to cooperative games with explicit communication channels, but competitive games and games involving only physical interactions between agents.
在上一节中,我们曾提出过简单的策略梯度方法在简单的多代理设置中效果不佳,这在第5节的实验中得到了支持。我们在本节中的目标是得出一种在这种设置下效果很好的算法。 但是,我们希望在以下约束条件下进行操作:(1)学习到的策略只能在执行时使用本地信息(即他们自己的观察结果),(2)我们不假定环境动力学的可区分模型,这与 [25]和(3)我们没有在代理之间的通信方法上采用任何特定的结构(即,我们没有采用可区分的通信渠道)。 满足上述需求将提供一种通用的多主体学习算法,该算法不仅可以应用于具有明确通信渠道的合作游戏,而且可以应用于竞争性游戏和仅涉及代理之间的物理交互的游戏。
Similarly to [8], we accomplish our goal by adopting the framework of centralized training with decentralized execution. Thus, we allow the policies to use extra information to ease training, so long as this information is not used at test time. It is unnatural to do this with Q-learning, as the Q function generally cannot contain different information at training and test time. Thus, we propose a simple extension of actor-critic policy gradient methods where the critic is augmented with extra information about the policies of other agents.
与[8]相似,我们通过采用分散执行的集中培训框架来实现我们的目标。 因此,我们允许策略使用额外的信息来简化培训,只要在测试时不使用此信息即可。 通过Q学习进行此操作是不自然的,因为Q函数通常在训练和测试时不能包含不同的信息。 因此,我们提出了行为者批评政策梯度方法的简单扩展,其中批评者被增加了有关其他代理人政策的额外信息。
Figure 1: Overview of our multi-agent decentralized actor, centralized critic approach.
Note that we require the policies of other agents to apply an update in Eq. 6. Knowing the observations and policies of other agents is not a particularly restrictive assumption; if our goal is to train agents to exhibit complex communicative behaviour in simulation, this information is often available to all agents. However, we can relax this assumption if necessary by learning the policies of other agents from observations — we describe a method of doing this in Section 4.2.
请注意,我们要求其他代理程序的策略在Eq中应用更新。 6.了解其他行为者的观点和政策并不是特别严格的假设; 如果我们的目标是训练代理在仿真中表现出复杂的交流行为,则此信息通常可供所有代理使用。 但是,如果有必要,我们可以通过从观察中学习其他代理的策略来放宽此假设-我们在第4.2节中介绍了执行此操作的方法。
4.2 Inferring Policies of Other Agents
4.3 Agents with Policy Ensembles
As previously mentioned, a recurring problem in multi-agent reinforcement learning is the environment non-stationarity due to the agents’ changing policies. This is particularly true in competitive settings, where agents can derive a strong policy by overfitting to the behavior of their competitors.
Such policies are undesirable as they are brittle and may fail when the competitors alter strategies. To obtain multi-agent policies that are more robust to changes in the policy of competing agents, we propose to train a collection of K different sub-policies. At each episode, we randomly select one particular sub-policy for each agent to execute. Suppose that policy µi is an ensemble of K
如前所述,在多智能体强化学习中经常出现的问题是由于智能体的政策变化而导致的环境不稳定。 在竞争环境中尤其如此,在这种环境中,代理商可以通过过度适应竞争对手的行为来制定强有力的政策。
这样的策略是不希望的,因为它们很脆弱,并且当竞争对手改变策略时可能会失败。 为了获得对竞争性代理商的政策的更改更稳健的多代理商策略,我们建议训练K个不同子策略的集合。 在每个情节中,我们为每个特工随机选择一个特定的子策略来执行。 假设策略µi是K的集合
5 Experiments1
5.1 Environments
To perform our experiments, we adopt the grounded communication environment proposed in [25]3 , which consists of N agents and L landmarks inhabiting a two-dimensional world with continuous space and discrete time. Agents may take physical actions in the environment and communication actions that get broadcasted to other agents. Unlike [25], we do not assume that all agents have identical action and observation spaces, or act according to the same policy π. We also consider games that are both cooperative (all agents must maximize a shared return) and competitive (agents have conflicting goals). Some environments require explicit communication between agents in order to achieve the best reward, while in other environments agents can only perform physical actions. We provide details for each environment below.
为了执行我们的实验,我们采用[25] 3中提出的接地通信环境,该环境由N个代理和L个地标组成,它们居住在具有连续空间和离散时间的二维世界中。 代理可能会在环境中采取物理操作,而通信操作会广播给其他代理。 与[25]不同,我们不假定所有主体都具有相同的行动和观察空间,或根据相同的政策π行动。 我们还考虑了既具有合作性(所有代理商必须最大化共享收益)又具有竞争性(代理商具有相互冲突的目标)的博弈。 某些环境要求代理程序之间进行显式通信才能获得最佳回报,而在其他环境中,代理程序只能执行物理动作。 我们在下面提供每种环境的详细信息。
Figure 2: Illustrations of the experimental environment and some tasks we consider, including a) Cooperative Communication b) Predator-Prey c) Cooperative Navigation d) Physical Deception. See webpage for videos of all experimental results.
图2:实验环境和我们考虑的一些任务的图示,包括a)合作交流b)捕食者-猎物c)合作导航d)物理欺骗。 有关所有实验结果的视频,请参见网页。
Cooperative communication. This task consists of two cooperative agents, a speaker and a listener, who are placed in an environment with three landmarks of differing colors. At each episode, the listener must navigate to a landmark of a particular color, and obtains reward based on its distance to the correct landmark. However, while the listener can observe the relative position and color of the landmarks, it does not know which landmark it must navigate to. Conversely, the speaker’s observation consists of the correct landmark color, and it can produce a communication output at each time step which is observed by the listener. Thus, the speaker must learn to output the landmark color based on the motions of the listener. Although this problem is relatively simple, as we show in Section 5.2 it poses a significant challenge to traditional RL algorithms. Cooperative navigation. In this environment, agents must cooperate through physical actions to reach a set of L landmarks. Agents observe the relative positions of other agents and landmarks, and are collectively rewarded based on the proximity of any agent to each landmark. In other words, the agents have to ‘cover’ all of the landmarks. Further, the agents occupy significant physical space and are penalized when colliding with each other. Our agents learn to infer the landmark they must cover, and move there while avoiding other agents. Keep-away. This scenario consists of L landmarks including a target landmark, N cooperating agents who know the target landmark and are rewarded based on their distance to the target, and M adversarial agents who must prevent the cooperating agents from reaching the target. Adversaries accomplish this by physically pushing the agents away from the landmark, temporarily occupying it. While the adversaries are also rewarded based on their distance to the target landmark, they do not know the correct target; this must be inferred from the movements of the agents.
合作交流。这项任务由两个合作代理商组成,分别是发言人和听众,他们被放置在具有三个不同颜色地标的环境中。在每个情节中,听众必须导航到特定颜色的地标,并根据其与正确地标的距离获得奖励。但是,尽管收听者可以观察到地标的相对位置和颜色,但它不知道必须导航到哪个地标。相反,演讲者的观察结果由正确的地标颜色组成,它可以在每个时间步上产生通信输出,听众可以观察到。因此,说话者必须学会根据听众的运动来输出标志性色彩。尽管此问题相对简单,但如我们在5.2节中所示,它对传统的RL算法提出了重大挑战。合作导航。在这种环境中,代理必须通过物理动作进行协作才能到达一组L个界标。特工观察其他特工和地标的相对位置,并根据任何特工与每个地标的接近程度共同获得奖励。换句话说,代理商必须“掩盖”所有地标。此外,这些代理占据很大的物理空间,并且在彼此碰撞时会受到惩罚。我们的特工学会推断他们必须覆盖的地标,并在避开其他特工的同时移到那里。远离。此场景包括L个地标,包括目标地标,N个合作伙伴(了解目标地标并根据其与目标的距离而获得奖励)以及M个对抗代理(必须阻止合作者到达目标)。敌手通过将特工从地标上推开,暂时占据地标来实现这一目标。尽管还根据对手到目标地标的距离来奖励他们,但他们不知道正确的目标;这必须从代理人的行动中推断出来。
Figure 3: Comparison between MADDPG and DDPG (left), and between single policy MADDPG and ensemble MADDPG (right) on the competitive environments. Each bar cluster shows the 0-1 normalized score for a set of competing policies (agent v adversary), where a higher score is better for the agent. In all cases, MADDPG outperforms DDPG when directly pitted against it, and similarly for the ensemble against the single MADDPG policies. Full results are given in the Appendix.
图3:在竞争环境中,MADDPG和DDPG之间的比较(左),以及单策略MADDPG和集成MADDPG之间的比较(右)。 每个条形图簇都显示一组竞争策略(座席对对手)的0-1标准化分数,其中较高的分数对座席更好。 在所有情况下,当直接反对时,MADDPG的表现都优于DDPG,并且在针对单个MADDPG策略的集成中,其表现类似。 完整的结果在附录中给出。
Physical deception. Here, N agents cooperate to reach a single target landmark from a total of N landmarks. They are rewarded based on the minimum distance of any agent to the target (so only one agent needs to reach the target landmark). However, a lone adversary also desires to reach the target landmark; the catch is that the adversary does not know which of the landmarks is the correct one. Thus the cooperating agents, who are penalized based on the adversary distance to the target, learn to spread out and cover all landmarks so as to deceive the adversary.
Predator-prey. In this variant of the classic predator-prey game, N slower cooperating agents must chase the faster adversary around a randomly generated environment with L large landmarks impeding the way. Each time the cooperative agents collide with an adversary, the agents are rewarded while the adversary is penalized. Agents observe the relative positions and velocities of the agents, and the positions of the landmarks.
Covert communication. This is an adversarial communication environment, where a speaker agent (‘Alice’) must communicate a message to a listener agent (‘Bob’), who must reconstruct the message at the other end. However, an adversarial agent (‘Eve’) is also observing the channel, and wants to reconstruct the message — Alice and Bob are penalized based on Eve’s reconstruction, and thus Alice must encode her message using a randomly generated key, known only to Alice and Bob. This is similar to the cryptography environment considered in [2].
物理欺骗。在这里,N个特工合作从总共N个地标中到达单个目标地标。根据任何特工到目标的最小距离来奖励他们(因此只有一名特工需要到达目标地标)。但是,一个孤独的对手也希望达到目标地标。问题是对手不知道哪个地标是正确的。因此,基于与目标之间的对手距离而受到惩罚的合作主体学会分散并覆盖所有地标,以欺骗对手。
捕食者-猎物。在经典的捕食者-猎物游戏的这种变体中,N个较慢的协作代理必须在随机生成的环境中追赶较快的对手,其中L个大地标会阻碍前进。每次合作代理商与对手发生冲突时,代理商都会得到奖励,而对手则受到惩罚。特工观察特工的相对位置和速度,以及地标的位置。
同伴交流。这是一种对抗性交流环境,说话者代理人(‘Alice’)必须将消息传达给监听者代理人(‘Bob’),后者必须在另一端重建消息。但是,对抗性代理(‘Eve’)也正在观察该频道,并希望重建该消息-Alice和Bob根据Eve的重建受到惩罚,因此Alice必须使用随机生成的**对她的消息进行编码,该**只有Alice知道和鲍勃。这类似于[2]中考虑的加密环境。
5.2 Comparison to Decentralized Reinforcement Learning Methods
We implement our MADDPG algorithm and evaluate it on the environments presented in Section 5.1. Unless otherwise specified, our policies are parameterized by a two-layer ReLU
MLP with 64 units per layer. The messages sent between agents are soft approximations to discrete messages, calculated using the GumbelSoftmax estimator [14]. To evaluate the quality of policies learned in competitive settings, we pitch MADDPG agents against DDPG agents, and compare the resulting success of the agents and adversaries in the environment. We train our models until convergence, and then evaluate them by averaging various metrics for 1000 further iterations. We provide the tables and details of our results on all environments in the Appendix, and summarize them here.
我们实现了MADDPG算法,并在5.1节介绍的环境中对其进行了评估。 除非另有说明,否则我们的策略由两层ReLU参数化
每层64个单元的MLP。 代理之间发送的消息是离散消息的软近似,使用GumbelSoftmax估计器[14]计算得出。 为了评估在竞争环境中学习到的策略的质量,我们将MADDPG代理与DDPG代理进行了比较,并比较了代理和对手在环境中所获得的成功。 我们训练模型直到收敛,然后通过对1000次迭代的各种指标求平均来评估它们。 我们在附录中提供了所有环境下的表格和结果的详细信息,并在此处进行汇总。
We first examine the cooperative communication scenario. Despite the simplicity of the task (the speaker only needs to learn to output its observation), traditional RL methods such as DQN, Actor Critic, a first-order implementation of TRPO, and DDPG all fail to learn the correct behaviour (measured by whether the listener is within a short distance from the target landmark). In practice we observed that the listener learns to ignore the speaker and simply moves to the middle of all observed landmarks. We plot the learning curves over 25000 episodes for various approaches in Figure 4.
我们首先检查合作交流场景。 尽管任务很简单(说话者只需要学习输出其观察结果),但传统的RL方法(例如DQN,Actor Critic,TRPO的一阶实现和DDPG)都无法学习正确的行为(根据是否 听众距离目标地标不远)。 在实践中,我们观察到听者学会了忽略说话者,而只是移至所有观察到的地标的中间。 我们为图4中的各种方法绘制了25000个情节的学习曲线。
Figure 5: Comparison between MADDPG (left) and DDPG (right) on the cooperative communication (CC) and physical deception (PD) environments at t = 0, 5, and 25. Small dark circles indicate landmarks. In CC, the grey agent is the speaker, and the color of the listener indicates the target landmark. In PD, the blue agents are trying to deceive the red adversary, while covering the target landmark (in green). MADDPG learns the correct behavior in both cases: in CC the speaker learns to output the target landmark color to direct the listener, while in PD the agents learn to cover both
landmarks to confuse the adversary. DDPG (and other RL algorithms) struggles in these settings: in CC the speaker always repeats the same utterance and the listener moves to the middle of the landmarks, and in PP one agent greedily pursues the green landmark (and is followed by the adversary) while the other agent scatters. See video for full trajectories.
图5:在t = 0、5和25时在协作通信(CC)和物理欺骗(PD)环境下MADDPG(左)和DDPG(右)之间的比较。小黑圈表示界标。 在CC中,灰色主体是说话者,而收听者的颜色表示目标地标。 在PD中,蓝色特工试图欺骗红色对手,同时遮盖目标地标(绿色)。 MADDPG在两种情况下都可以学习正确的行为:在CC中,说话者学习输出目标地标颜色以指导听众,而在PD中,座席学习覆盖这两种情况
地标使对手迷惑。 DDPG(和其他RL算法)在这些情况下会遇到困难:在CC中,说话者总是重复相同的话语,听众移到地标的中间,而在PP中,一个特工贪婪地追逐绿色地标(并且跟随对手) 而其他代理则分散。 有关完整轨迹,请参见视频。
We hypothesize that a primary reason for the failure of traditional RL methods in this (and other) multi-agent settings is the lack of a consistent gradient signal. For example, if the speaker utters the correct symbol while the listener moves in the wrong direction, the speaker is penalized. This problem is exacerbated as the number of time steps grows: we observed that traditional policy gradient methods can learn when the objective of the listener is simply to reconstruct the observation of the speaker in a single time step, or if the initial positions of agents and landmarks are fixed and evenly distributed. This indicates that many of the multi-agent methods previously proposed for scenarios with short time horizons (e.g. [16]) may not generalize to more complex tasks.
我们假设在此(和其他)多代理设置中,传统RL方法失败的主要原因是缺少一致的梯度信号。 例如,如果在听众向错误方向移动时扬声器说出正确的符号,则扬声器将受到惩罚。 随着时间步长的增加,这个问题变得更加严重:我们观察到,传统的策略梯度方法可以学习听众的目标只是简单地在单个时间步长上重建讲话者的观察力,还是代理商的初始位置和 地标是固定且均匀分布的。 这表明,先前针对时间较短的场景(例如[16])提出的许多多主体方法可能无法推广到更复杂的任务。
Figure 6: Policy learning success rate on cooperative communication after 25000 episodes.
图6:25000集后在合作交流中的政策学习成功率
Conversely, MADDPG agents can learn coordinated behavior more easily via the centralized critic. In the cooperative communication environment, MADDPG is able to reliably learn the correct listener and speaker policies, and
the listener is often (84.0% of the time) able to navigate to the target. A similar situation arises for the physical deception task: when the cooperating agents are trained with MADDPG, they are able to successfully deceive the adversary by covering all of the landmarks around 94% of the time when L = 2 (Figure 5). Furthermore, the adversary success is quite low, especially when the adversary is trained with DDPG (16.4% when L = 2). This contrasts sharply with the behaviour learned by the cooperating DDPG agents, who are unable to deceive MADDPG adversaries in any scenario, and do not even deceive other DDPG agents when L = 4.
相反,MADDPG代理可以通过集中评论者更轻松地学习协调行为。 在协作通信环境中,MADDPG能够可靠地学习正确的听众和讲话者策略,并且侦听器通常(有84.0%的时间)能够导航到目标。 物理欺骗任务也会出现类似情况:当合作代理人接受MADDPG训练时,他们能够通过在L = 2时大约94%的时间覆盖所有地标来成功欺骗对手(图5)。 此外,对手的成功率非常低,尤其是当对手使用DDPG训练时(当L = 2时为16.4%)。 这与合作的DDPG代理了解到的行为形成鲜明对比,后者在任何情况下都无法欺骗MADDPG对手,甚至在L = 4时也无法欺骗其他DDPG代理。
While the cooperative navigation and predator-prey tasks have a less stark divide between success and failure, in both cases the MADDPG agents outperform the DDPG agents. In cooperative navigation, MADDPG agents have a slightly smaller average distance to each landmark, but have almost half the average number of collisions per episode (when N = 2) compared to DDPG agents due to the ease of coordination. Similarly, MADDPG predators are far more successful at chasing DDPG prey (16.1 collisions/episode) than the converse (10.3 collisions/episode).
尽管协作导航和捕食者-猎物任务在成功和失败之间的区别较小,但在两种情况下,MADDPG代理的性能都优于DDPG代理。 在协作导航中,MADDPG代理到每个地标的平均距离略小,但由于易于协调,与DDPG代理相比,每集的平均碰撞次数(当N = 2时)几乎是其一半。 同样,MADDPG捕食者在追赶DDPG猎物(16.1次碰撞/情节)方面比逆向(10.3次碰撞/情节)成功得多。
Figure 7: Effectiveness of learning by approximating policies of other agents in the cooperative communication scenario. Left: plot of the reward over number of iterations; MADDPG agents quickly learn to solve the task when approximating the policies of others. Right: KL divergence between the
approximate policies and the true policies.
图7:通过在协作交流场景中近似其他代理的策略来学习的有效性。 左:在迭代次数上的奖励图; MADDPG代理在接近其他策略时会快速学习解决任务。 右:吉隆坡之间的分歧
近似政策和真实政策。
In the covert communication environment, we found that Bob trained with both MADDPG and DDPG out-performs Eve in terms of reconstructing Alice’s message. However, Bob trained with MADDPG achieves a larger relative success rate compared with DDPG (52.4% to 25.1%). Further, only Alice trained with MADDPG can encode her message such that Eve achieves near-random reconstruction accuracy. The learning curve (a sample plot is shown in Appendix) shows that the oscillation due to the competitive nature of the environment often cannot be overcome with common
decentralized RL methods. We emphasize that we do not use any of the tricks required for the cryptography environment from [2], including modifying Eve’s loss function, alternating agent and adversary training, and using a hybrid ‘mix & transform’ feed-forward and convolutional architecture.
在隐蔽的通信环境中,我们发现鲍勃接受MADDPG和DDPG的训练,在重构爱丽丝的讯息方面胜过了夏娃。 但是,与DDPG相比,接受过MADDPG训练的Bob的相对成功率更高(52.4%至25.1%)。 此外,只有经过MADDPG培训的爱丽丝才能对她的消息进行编码,以使夏娃达到近乎随机的重构精度。 学习曲线(附录中显示了示例图)表明,由于环境的竞争性而导致的振荡通常无法通过常见的方法克服。
分散式RL方法。 我们强调,我们不会使用[2]中的加密环境所需的任何技巧,包括修改Eve的损失功能,交替代理和对手训练,以及使用混合的“混合和转换”前馈和卷积架构。
5.3 Effect of Learning Polices of Other Agents
We evaluate the effectiveness of learning the policies of other agents in the cooperative communication environment, following the same hyperparameters as the previous experiments and setting λ = 0.001 in Eq. 7. The results are shown in Figure 7. We observe that despite not fitting the policies of other agents perfectly (in particular, the approximate listener policy learned by the speaker has a fairly large KL divergence to the true policy), learning with approximated policies is able to achieve the same success rate as using the true policy, without a significant slowdown in convergence.
我们遵循与先前实验相同的超参数,并在公式中设置λ= 0.001,评估在协作通信环境中学习其他代理策略的有效性。 7.结果如图7所示。我们观察到,尽管不能很好地拟合其他代理的策略(特别是,说话者学习的近似侦听器策略与实际策略之间存在很大的KL差异),但学习近似策略 能够实现与使用真实策略相同的成功率,而不会显着降低收敛速度。
5.4 Effect of Training with Policy Ensembles
We focus on the effectiveness of policy ensembles in competitive environments, including keep-away, cooperative navigation, and predator-prey. We choose K = 3 sub-policies for the keep-away and cooperative navigation environments, and K = 2 for predator-prey. To improve convergence speed,
we enforce that the cooperative agents should have the same policies at each episode, and similarly for the adversaries. To evaluate the approach, we measure the performance of ensemble policies and single policies in the roles of both agent and adversary. The results are shown on the right side of Figure 3. We observe that agents with policy ensembles are stronger than those with a single policy. In particular, when pitting ensemble agents against single policy adversaries (second to left bar cluster), the ensemble agents outperform the adversaries by a large margin compared to when the roles are reversed (third to left bar cluster).
我们关注政策集合在竞争环境中的有效性,包括禁忌,合作导航和捕食者—猎物。 对于暂离和协作导航环境,我们选择K = 3个子策略,对于捕食者-猎物,我们选择K = 2个子策略。 为了提高收敛速度,我们要求合作代理商在每一集都应采取相同的政策,对对手也应采取类似的政策。 为了评估该方法,我们在代理和对手的角色中测量整体策略和单个策略的性能。 结果显示在图3的右侧。我们观察到具有策略集合的代理要强于具有单个策略的代理。 尤其是,在使集成代理与单个策略对手(第二到左小节集群)竞争时,与角色被反转(第三到左小节集群)相比,集成代理在性能上胜过对手。
6 Conclusions and Future Work
We have proposed a multi-agent policy gradient algorithm where agents learn a centralized critic based on the observations and actions of all agents. Empirically, our method outperforms traditional RL algorithms on a variety of cooperative and competitive multi-agent environments. We can further improve the performance of our method by training agents with an ensemble of policies, an approach we believe to be generally applicable to any multi-agent algorithm.
One downside to our approach is that the input space of Q grows linearly (depending on what information is contained in x) with the number of agents N. This could be remedied in practice by, for example, having a modular Q function that only considers agents in a certain neighborhood of a given agent. We leave this investigation to future work.
我们提出了一种多代理策略梯度算法,其中代理根据所有代理的观察和操作来学习集中式注释。 从经验上讲,我们的方法在各种合作和竞争的多主体环境中均优于传统的RL算法。 我们可以通过对代理进行策略整合来进一步提高方法的性能,我们认为这种方法通常适用于任何多代理算法。
我们方法的一个缺点是Q的输入空间随着代理的数量N线性增长(取决于x中包含的信息)。例如,可以通过使用仅考虑以下因素的模块化Q函数来纠正这一问题: 给定代理的某个邻域中的代理。 我们将这项调查留给以后的工作。
Acknowledgements
The authors would like to thank Jacob Andreas, Smitha Milli, Jack Clark, Jakob Foerster, and others at OpenAI and UC Berkeley for interesting discussions related to this paper, as well as Jakub Pachocki, Yura Burda, and Joelle Pineau for comments on the paper draft. We thank Tambet Matiisen for providing the code base that was used for some early experiments associated with this paper. Ryan Lowe is supported in part by a Vanier CGS Scholarship and the Samsung Advanced Institute of Technology. Finally, we’d like to thank OpenAI for fostering an engaging and productive research environment.
References
[1] DeepMind AI reduces google data centre cooling bill by 40.
https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/. Accessed:
2017-05-19.
[2] M. Abadi and D. G. Andersen. Learning to protect communications with adversarial neural
cryptography. arXiv preprint arXiv:1610.06918, 2016.
[3] C. Boutilier. Learning conventions in multiagent stochastic domains using likelihood estimates.
In Proceedings of the Twelfth international conference on Uncertainty in artificial intelligence,
pages 106–114. Morgan Kaufmann Publishers Inc., 1996.
[4] L. Busoniu, R. Babuska, and B. De Schutter. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems Man and Cybernetics Part C Applications and
Reviews, 38(2):156, 2008.
[5] G. Chalkiadakis and C. Boutilier. Coordination in multiagent reinforcement learning: a bayesian
approach. In Proceedings of the second international joint conference on Autonomous agents
and multiagent systems, pages 709–716. ACM, 2003.
[6] P. Dayan and G. E. Hinton. Feudal reinforcement learning. In Advances in neural information
processing systems, pages 271–271. Morgan Kaufmann Publishers, 1993.
[7] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson. Counterfactual multi-agent
policy gradients. arXiv preprint arXiv:1705.08926, 2017.
[8] J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson. Learning to communicate with
deep multi-agent reinforcement learning. CoRR, abs/1605.06676, 2016.
[9] J. N. Foerster, N. Nardelli, G. Farquhar, P. H. S. Torr, P. Kohli, and S. Whiteson. Stabilising
experience replay for deep multi-agent reinforcement learning. CoRR, abs/1702.08887, 2017.
[10] M. C. Frank and N. D. Goodman. Predicting pragmatic reasoning in language games. Science,
336(6084):998–998, 2012.
[11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and
Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems,
pages 2672–2680, 2014.
[12] J. K. Gupta, M. Egorov, and M. Kochenderfer. Cooperative multi-agent control using deep
reinforcement learning. 2017.
[13] J. Hu and M. P. Wellman. Online learning about other agents in a dynamic multiagent system.
In Proceedings of the Second International Conference on Autonomous Agents, AGENTS ’98,
pages 239–246, New York, NY, USA, 1998. ACM.
[14] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv
preprint arXiv:1611.01144, 2016.
[15] M. Lauer and M. Riedmiller. An algorithm for distributed reinforcement learning in cooperative
multi-agent systems. In In Proceedings of the Seventeenth International Conference on Machine
Learning, pages 535–542. Morgan Kaufmann, 2000.
[16] A. Lazaridou, A. Peysakhovich, and M. Baroni. Multi-agent cooperation and the emergence of
(natural) language. arXiv preprint arXiv:1612.07182, 2016.
[17] J. Z. Leibo, V. F. Zambaldi, M. Lanctot, J. Marecki, and T. Graepel. Multi-agent reinforcement
learning in sequential social dilemmas. CoRR, abs/1702.03037, 2017.
[18] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies.
arXiv preprint arXiv:1504.00702, 2015.
[19] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra.
Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
[20] M. L. Littman. Markov games as a framework for multi-agent reinforcement learning. In
Proceedings of the eleventh international conference on machine learning, volume 157, pages
157–163, 1994.
[21] L. Matignon, L. Jeanpierre, A.-I. Mouaddib, et al. Coordinated multi-robot exploration under
communication constraints using decentralized markov decision processes. In AAAI, 2012.
[22] L. Matignon, G. J. Laurent, and N. Le Fort-Piat. Hysteretic q-learning: an algorithm for
decentralized reinforcement learning in cooperative multi-agent teams. In Intelligent Robots
and Systems, 2007. IROS 2007. IEEE/RSJ International Conference on, pages 64–69. IEEE,
2007.
[23] L. Matignon, G. J. Laurent, and N. Le Fort-Piat. Independent reinforcement learners in cooperative markov games: a survey regarding coordination problems. The Knowledge Engineering
Review, 27(01):1–31, 2012.
[24] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,
M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
[25] I. Mordatch and P. Abbeel. Emergence of grounded compositional language in multi-agent
populations. arXiv preprint arXiv:1703.04908, 2017.
[26] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian. Deep decentralized multi-task
multi-agent reinforcement learning under partial observability. CoRR, abs/1703.06182, 2017.
[27] L. Panait and S. Luke. Cooperative multi-agent learning: The state of the art. Autonomous
Agents and Multi-Agent Systems, 11(3):387–434, Nov. 2005.
[28] P. Peng, Q. Yuan, Y. Wen, Y. Yang, Z. Tang, H. Long, and J. Wang. Multiagent bidirectionallycoordinated nets for learning to play starcraft combat games. CoRR, abs/1703.10069, 2017.
[29] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser,
I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis.
Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484 –
489, 2016.
[30] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy
gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning,
pages 387–395, 2014.
[31] S. Sukhbaatar, R. Fergus, et al. Learning multiagent communication with backpropagation. In
Advances in Neural Information Processing Systems, pages 2244–2252, 2016.
[32] S. Sukhbaatar, I. Kostrikov, A. Szlam, and R. Fergus. Intrinsic motivation and automatic
curricula via asymmetric self-play. arXiv preprint arXiv:1703.05407, 2017.
[33] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press
Cambridge, 1998.
[34] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing
systems, pages 1057–1063, 2000.
[35] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente. Multiagent cooperation and competition with deep reinforcement learning. PloS one,
12(4):e0172395, 2017.
[36] M. Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings
of the tenth international conference on machine learning, pages 330–337, 1993.
[37] G. Tesauro. Extending q-learning to general adaptive multi-agent systems. In Advances in
neural information processing systems, pages 871–878, 2004.
[38] P. S. Thomas and A. G. Barto. Conjugate markov decision processes. In Proceedings of the
28th International Conference on Machine Learning (ICML-11), pages 137–144, 2011.
[39] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement
learning. Machine learning, 8(3-4):229–256, 1992.
Experimental Results
In all of our experiments, we use the Adam optimizer with a learning rate of 0.01 and τ = 0.01 for updating the target networks. γ is set to be 0.95. The size of the replay buffer is 106 and we update the network parameters after every 100 samples added to the replay buffer. We use a batch size of
1024 episodes before making an update, except for TRPO where we found a batch size of 50 lead to better performance (allowing it more updates relative to MADDPG). We train with 10 random seeds for environments with stark success/ fail conditions (cooperative communication, physical deception, and covert communication) and 3 random seeds for the other environments. The details of the experimental results are shown in the following tables.
在所有实验中,我们使用学习率为0.01且τ= 0.01的Adam优化器来更新目标网络。 γ设定为0.95。 重播缓冲区的大小为106,每100个样本添加到重播缓冲区后,我们会更新网络参数。 我们使用的批量大小为进行更新前需要1024集,但TRPO除外(我们发现批量大小为50会导致更好的性能(相对于MADDPG可以进行更多更新)。 我们针对具有成功/失败条件(合作交流,身体欺骗和秘密交流)的环境训练了10个随机种子,针对其他环境训练了3个随机种子。 下表中显示了实验结果的详细信息。
Table 1: Percentage of episodes where the agent reached the target landmark and average distance from the target in the cooperative communication environment, after 25000 episodes. Note that the percentage of targets reached is different than the policy learning success rate in Figure 6, which indicates the percentage of runs in which the correct policy was learned (consistently reaching the target landmark). Even when the correct behavior is learned, agents occasionally hover slightly outside the target landmark on some episodes, and conversely agents who learn to go to the middle of the landmarks occasionally stumble upon the correct landmark.
表1:在25000个情节之后,代理在协作通信环境中到达目标界标的情节百分比和与目标的平均距离。 请注意,达到目标的百分比与图6中的策略学习成功率不同,图6显示了学习正确策略(始终达到目标界标)的运行百分比。 即使学会了正确的行为,特工有时也会在某些情节上徘徊在目标地标外,反之,学会去地标中间的特工有时也会偶然发现正确的地标。
Table 2: Average # of collisions per episode and average agent distance from a landmark in the cooperative navigation task, using 2-layer 128 unit MLP policies.
表2:使用2层128单位MLP策略,在协作导航任务中每集的平均碰撞次数和与地标的平均代理距离。
Table 3: Average number of prey touches by predator per episode on two predator-prey environments with N = L = 3, one where the prey (adversaries) are slightly (30%) faster (PP1), and one where they are significantly (100%) faster (PP2). All policies in this experiment are 2-layer 128 unit MLPs.
表3:在N = L = 3的两种捕食者-猎物环境中,每集捕食者触摸猎物的平均次数,其中一种(猎物(对手)略快(30%))(PP1),另一种则显着( 100%)(PP2)。 本实验中的所有策略均为2层128单元MLP。
Table 4: Results on the physical deception task, with N = 2 and 4 cooperative agents/landmarks. Success (succ %) for agents (AG) and adversaries (ADV) is if they are within a small distance from the target landmark.
表4:物理欺骗任务的结果,N = 2和4个合作代理/地标。 代理商(AG)和对手(ADV)的成功率(成功百分比)是如果它们与目标地标相距不远。
Table 5: Agent (Bob) and adversary (Eve) success rate (succ %, i.e. correctly reconstructing the speaker’s message) in the covert communication environment. The input message is drawn from a set of two 4-dimensional one-hot vectors.
表5:在隐蔽的交流环境中,特工(Bob)和对手(Eve)的成功率(成功百分比,即正确地重构说话者的消息)。 输入消息是从一组两个4维单热向量中得出的。
Variance of Policy Gradient Algorithms in a Simple Multi-Agent Setting
To analyze the variance of policy gradient methods in multi-agent settings, we consider a simple cooperative scenario with N agents and binary actions: P(ai = 1) = θi . We define the reward to be 1 if all actions are the same a1 = a2 = . . . = aN , and 0 otherwise. This is a simple scenario with no temporal component: agents must simply learn to either always output 1 or always output 0 at each time step. Despite this, we can show that the probability of taking a gradient step in the correct direction decreases exponentially with the number of agents N.
为了分析多代理设置中策略梯度方法的方差,我们考虑一个具有N个代理和二元操作的简单协作方案:P(ai = 1)=θi。 如果所有动作都相同a1 = a2 =,我们将奖励定义为1。 。 。 = aN,否则为0。 这是一个简单的场景,没有时间成分:代理必须简单地学会在每个时间步长总是输出1或总是输出0。 尽管如此,我们可以显示出沿正确方向执行梯度步长的概率随代理人N的数量呈指数下降。