多智能体强化学习:Learning to Cooperate, Compete, and Communicate
Multiagent environments where agents compete for resources are stepping stones on the path to AGI. Multiagent environments have two useful properties: first, there is a natural curriculum—the difficulty of the environment is determined by the skill of your competitors (and if you’re competing against clones of yourself, the environment exactly matches your skill level). Second, a multiagent environment has no stable equilibrium: no matter how smart an agent is, there’s always pressure to get smarter. These environments have a very different feel from traditional environments, and it’ll take a lot more research before we become good at them.
智能体争夺资源的多智能体环境正逐渐成为通往AGI的垫脚石。 多主体环境具有两个有用的属性:首先,有自然的课程设置-环境的难度取决于竞争对手的技能(并且,如果您要与自己的克隆人竞争,则环境与您的技能水平完全匹配)。 其次,多主体环境没有稳定的平衡:无论主体有多聪明,总是有变得更聪明的压力。 这些环境与传统环境的感觉截然不同,我们需要对其进行大量研究才能变得更好。
We’ve developed a new algorithm, MADDPG, for centralized learning and decentralized execution in multiagent environments, allowing agents to learn to collaborate and compete with each other.
我们已经开发了一种新算法MADDPG,用于在多主体环境中进行集中学习和分散执行,从而使主体能够学习协作和相互竞争。
MADDPG used to train four red agents to chase two green agents. The red agents have learned to team up with one another to chase a single green agent, gaining higher reward. The green agents, meanwhile, learned to split up, and while one is being chased the other tries to approach the water (blue circle) while avoiding the red agents.
MADDPG用于训练四个红色智能体来追逐两个绿色特工。 红色智能体学会了互相合作,追逐一个绿色智能体,获得了更高的奖励。 同时,绿色智能体学会了分散,在追赶一个时,另一个尝试避开红色智能体而接近水(蓝色圆圈)
MADDPG extends a reinforcement learning algorithm called DDPG, taking inspiration from actor-critic reinforcement learning techniques; other groups are exploring variations and parallel implementations of these ideas.
MADDPG扩展了一种名为DDPG的强化学习算法,它从[actor-critic强化学习]技术获得灵感; 其他小组则是这些想法的探索变量和并行实现。
We treat each agent in our simulation as an “actor”, and each actor gets advice from a “critic” that helps the actor decide what actions to reinforce during training. Traditionally, the critic tries to predict the value (i.e. the reward we expect to get in the future) of an action in a particular state, which is used by the agent - the actor - to update its policy. This is more stable than directly using the reward, which can vary considerably. To make it feasible to train multiple agents that can act in a globally-coordinated way, we enhance our critics so they can access the observations and actions of all the agents, as the following diagram shows.
我们将模拟中的每个代理视为“参与者”,每个参与者都从“批评家”那里获得建议,以帮助参与者决定在培训期间应加强哪些措施。 传统上,评论家会尝试预测处于特定状态的行为的价值(即,我们期望在将来获得的回报),智能体(参与者)会使用该行为来更新其政策。 这比直接使用奖励要稳定得多,后者可能会有很大差异。 为了使训练可以以全局协调方式采取行动的智能体变得可行,我们增强了批评家的实力,使他们可以访问所有智能体的观察和行动,如下图所示。
Our agents don’t need to access the central critic at test time; they act based on their observations in combination with their predictions of other agents behaviors’. Since a centralized critic is learned independently for each agent, our approach can also be used to model arbitrary reward structures between agents, including adversarial cases where the rewards are opposing.
我们的智能体无需在测试时与中央评论员联系; 他们会根据自己的观察结果和对其他智能体行为的预测一起采取行动。 由于可以为每个智能体独立地学习集中的批评家,因此我们的方法也可以用于对智能体之间的任意奖励结构进行建模,包括奖励相对的对抗性案例。
We tested our approach on a variety of tasks and it performed better than DDPG on all of them. In the above animations you can see, from left to right: two AI agents trying to go to a specific location and learning to split up to hide their intended location from the opposing agent; one agent communicating the name of a landmark to another agent; and three agents coordinating to travel to landmarks without bumping into each other.
我们在各种任务上测试了我们的方法,并且在所有任务上其性能均优于DDPG。 在上面的动画中,您可以从左到右看到:两个AI智能体试图去到一个特定的位置,并学习拆分以向相对的代理隐藏它们的预期位置; 一个代理将地标的名称传达给另一个智能体; 和三个智能体协调前往地标而不会碰到对方。
Red agents trained with MADDPG exhibit more complex behaviors than those trained with DDPG. In the above animation we see agents trained with our technique (left) and DDPG (right) attempting to chase green agents through green forests and around black obstacles. Our agents catch more agents and visibly coordinate more than those trained with DDPG.
用MADDPG训练的红色智能体表现出比用DDPG训练的红色智能体更复杂的行为。 在上面的动画中,我们看到使用我们的技术(左)和DDPG(右)训练的智能体试图通过绿色森林和黑色障碍物追赶绿色智能体。 与接受DDPG训练的人员相比,我们的智能体可以吸引更多的智能体,并且在视觉上可以协调更多。
Where Traditional RL Struggles
Traditional decentralized RL approaches — DDPG, actor-critic learning, deep Q-learning, and so on — struggle to learn in multiagent environments, as at every time step each agent will be trying to learn to predict the actions of other agents while also taking its own actions. This is especially true in competitive situations. MADDPG employs a centralized critic to supply agents with information about their peers’ observations and potential actions, transforming an unpredictable environment into a predictable one.
传统的分散式RL方法(DDPG,行为者批判学习,深度Q学习等)在多主体环境中很难学习,因为每个主体在每个步骤都将尝试学习预测其他主体的行为,同时还要采取行动 自己的行为。 在竞争情况下尤其如此。 MADDPG雇用了一个集中的批判者,向坐席提供有关同伴的观察和可能采取的行动的信息,从而将不可预测的环境转变为可预测的环境。
Using policy gradient methods presents further challenges: because these exhibit high variance learning the right policy is difficult to do when the reward is inconsistent. We also found that adding in a critic, while improving stability, still failed to solve several of our environments such as cooperative communication. It seems that considering the actions of others during training is important for learning collaborative strategies.
使用策略梯度方法提出了进一步的挑战:由于这些方法具有较高的方差学习能力,因此当奖励不一致时,很难制定正确的策略。 我们还发现,添加批注者在提高稳定性的同时仍无法解决我们的几种环境,例如合作交流。 看来在训练期间考虑其他人的行为对于学习协作策略很重要。
Initial Research
Before we developed MADDPG, when using decentralized techniques, we noticed that listener agents would often learn to ignore the speaker if it sent inconsistent messages about where to go to. The agent would then set all the weights associated with the speaker’s message to 0, effectively deafening itself. Once this happens, it’s hard for training to recover, since the speaker will never know if it says the right thing due to the absence of any feedback. To fix this, we looked at a technique outlined in a recent hierarchical reinforcement project, which lets us force the listener to incorporate the utterances of the speaker in its decision-making process. This fix didn’t work, because though it forces the listener to pay attention to the speaker, it doesn’t help the speaker figure out what to say that is relevant. Our centralized critic method helps deal with these challenges, by helping the speaker to learn which utterances might be relevant to the actions of other agents. For more of our results, you can watch the following video:
在开发MADDPG之前,使用分散技术时,我们注意到,如果听众代理发送关于去往何处的不一致消息,则收听者智能体通常会学会忽略讲话者。然后,座席会将与讲话者的消息相关联的所有权重都设置为0,从而有效地使其失聪。一旦发生这种情况,训练就很难恢复,因为由于没有任何反馈,演讲者永远不会知道它说的是对的。为了解决这个问题,我们研究了最近的分层强化项目中概述的技术,该技术使我们能够迫使听众将讲话者的话语纳入其决策过程中。此修补程序没有用,因为尽管它迫使听众注意说话者,但并不能帮助说话者弄清楚该说些什么。我们集中的批评者方法可以帮助说话者了解哪些话语可能与其他行为者的行为有关,从而帮助应对这些挑战。有关我们更多的结果,您可以观看以下视频:
Next Steps
Agent modeling has a rich history within artificial intelligence research and many of these scenarios have been studied before. Lots of previous research considered games with only a small number of time steps with a small state space. Deep learning lets us deal with complex visual inputs and RL gives us tools to learn behaviors over long time periods. Now that we can use these capabilities to train multiple-agents at once without them needing to know the dynamics of the environment (how the environment changes at each time-step), we can tackle a wider range of problems involving communication and language while learning from environments’ high-dimensional information. If you’re interesting in exploring different approaches to evolving agents then consider joining OpenAI.
智能体建模在人工智能研究中有丰富的历史,并且以前已经研究了许多这样的场景。 许多先前的研究都认为游戏只有很少的时间步长和很小的状态空间。 深度学习使我们能够处理复杂的视觉输入,而RL为我们提供了长时间学习行为的工具。 既然我们可以立即使用这些功能来训练多智能体,而无需他们了解环境的动态信息(环境在每个时间步长如何变化),那么我们可以在学习的同时解决涉及沟通和语言的更多问题 来自环境的高维信息。 如果您有兴趣探索发展智能体的不同方法,请考虑加入OpenAI。