什么是强化学习
Machine Learning is one of the most discussed fields in the IT world. With the large spread of image and speech recognition, self-driving vehicles, product recommendations, and fraud detection, Machine Learning is everywhere.
中号achine学习是在IT界讨论最多的领域之一。 随着图像和语音识别,自动驾驶汽车,产品推荐以及欺诈检测的广泛传播,机器学习无处不在。
One subfield of ML focuses on discovering solutions to problems through self-learning. Here, we can take the example of video games. When a small group of researchers from a company called Deepmind has published a paper on playing Atari with Reinforcement learning, this field has received a lot of attention. So much that Google has later purchased the company for a large sum of money.
机器学习的一个子领域致力于通过自学发现问题的解决方案。 在这里,我们以视频游戏为例。 当来自一家名为Deepmind的公司的一小组研究人员发表了一篇关于通过强化学习玩Atari的论文时,这一领域受到了很多关注。 如此之多,以至于Google后来以巨额资金收购了该公司。
When AlphaGo has beaten the world champion in the game of Go, the power of reinforcement learning was no longer left unnoticed. After reading this article you will have a basic understanding of the ideas behind reinforcement learning and its applications.
当AlphaGo在围棋比赛中击败世界冠军时,强化学习的力量不再被忽视。 阅读本文后,您将对强化学习及其应用背后的思想有基本的了解。
强化学习 (Reinforcement Learning)
Reinforcement learning is the training of machine learning models to make a sequence of decisions for a given scenario.
强化学习是对机器学习模型的训练,以针对给定场景做出一系列决策。
At its core, we have an autonomous agent such as a person, robot, or deep net learning to navigate an uncertain environment. The goal of this agent is to maximize the numerical reward.
在其核心部分,我们拥有一个自治的代理程序,例如人,机器人或深度网络学习,以在不确定的环境中导航。 该代理的目标是最大化数字奖励。
Sports are a great example of this. Let's consider what our agent will have to deal with in a tennis match.
体育就是一个很好的例子。 让我们考虑一下我们的经纪人在网球比赛中必须处理的问题。
The agent will have to consider its actions, such as serves and volleys. These actions change the state of the game. In other words, the current set score and the leading player.
代理人必须考虑其服务,例如发球和凌空抽射。 这些动作会改变游戏状态。 换句话说,当前的设定得分和领先者。
Every action is performed with a reward in mind. The agent has to win a point in order to win a game, set, and match.
每个动作在执行时都会考虑到奖励。 代理商必须赢得积分才能赢得比赛,设置和比赛。
Our player needs to follow certain rules and strategies to maximize the final score.
我们的玩家需要遵循某些规则和策略,以最大程度地提高最终得分。
To build a model based on this, it will have to take a state and an action as input, which then will be transferred into the maximum possible reward. The model will also have to think ahead and consider the long term results of such actions.
要基于此构建模型,必须将状态和操作作为输入,然后将其转换为最大可能的报酬。 该模型还必须提前考虑并考虑此类行动的长期结果。
This process varies for each task, which is not surprising. Building a model that can play tennis versus Atari is fairly different.
此过程因每个任务而异,这并不奇怪。 建立可以打网球和打雅达利的模型是完全不同的。
监督学习与强化学习 (Supervised Learning VS Reinforcement Learning)
Reinforcement learning is not just a smart way to say supervised learning. Supervised learning is focused on making sense of the environment based on historical examples.
强化学习不仅仅是说监督学习的聪明方法。 监督学习的重点是基于历史实例来理解环境。
However, this is not always appropriate. A common example would be driving in traffic.
但是,这并不总是合适的。 一个常见的例子是驾驶交通。
Imagine doing this based on the observations made from the day before, when there were barely any cars. That’s as effective as driving by only looking at the rearview mirrors.
想象一下,根据前一天几乎没有汽车的观察结果来做这件事。 这与仅看着后视镜驾驶一样有效。
Reinforcement learning is all about collecting rewards. The agent focuses on making proper turns, signaling when necessary, and not breaking the speed limits. Also, the bot can lose points for dangerous actions, such as speeding.
强化学习就是收集奖励。 该代理专注于做出适当的转弯,在必要时发信号,并且不违反速度限制。 而且,机器人可能会因危险动作(例如超速驾驶)而失去积分。
The goal is to maximize the number of points by given the current state in traffic. The emphasis here is that an action causes the change in state, which supervised learning does not focus on.
目标是通过指定交通的当前状态来最大化点数。 这里的重点是动作会导致状态的改变,而监督学习并不关注这种状态的改变。
探索与开发 (Exploration VS Exploitation)
Let's say we place a brand new robot into a room. His goal is to find bolts and screws. The robot here is an agent and the room is the environment. He has four possible actions: moving left, right, forward, or backward.
假设我们将一个全新的机器人放到一个房间里。 他的目标是找到螺栓和螺钉。 这里的机器人是代理,房间是环境。 他有四个可能的动作:向左,向右,向前或向后移动。
The robot’s state consists of a few things, which are his current location and previous location. If he moves, the state changes but we do not know yet whether the move was right or wrong.
机器人的状态由几件事组成,它们是他的当前位置和先前的位置。 如果他动了,状态会改变,但我们尚不知道此动是对还是错。
So we let our robot explore the room. After walking around for a while, our robot finally finds some screws. The bot receives a reward for doing the right thing.
因此,我们让机器人探索房间。 走了一会儿之后,我们的机器人终于找到了一些螺丝。 该机器人会因做正确的事而获得奖励。
Now, the policy our robot has to follow is the maximization of points. He knows that by taking the same route to the screws again he will definitely reach the goal and receive some positive feedback.
现在,我们的机器人必须遵循的策略是最大化积分。 他知道,通过再次采用同样的方法,他一定会达到目标并获得一些积极的反馈。
However, this path is far from optimal and any type of robot can just follow the same path again and again. Here comes the tradeoff between exploration and exploration.
但是,此路径远非最佳,任何类型的机器人都只能一次又一次地遵循相同的路径。 这是探索与探索之间的权衡。
If our robot follows the same path, he will exploit what he has learned and eventually reach the goal. However, if we let the robot wander around looking for better paths, he will be making use of exploration.
如果我们的机器人遵循相同的道路,他将利用他所学的知识并最终达到目标。 但是,如果我们让机器人四处寻找更好的路径,他将利用探索。
Let’s say we give our robot 10 rounds to learn about the environment and find better routes. After some time he finds a path that takes only half the amount of steps needed previously. However, in the process, he has taken some bad paths as well.
假设我们给我们的机器人10个回合,以了解环境并找到更好的路线。 一段时间后,他发现一条路径仅占用以前所需步骤的一半。 但是,在此过程中,他也走了一些不好的路。
In this case, if we let another robot run the 10 rounds and only use the first path he found, this bot might end up collecting more bolts and screws.
在这种情况下,如果我们让另一个机器人运行10个回合并且仅使用他发现的第一个路径,则该机器人可能最终会收集更多的螺栓和螺钉。
This is the tradeoff researches have to deal with when developing reinforcement learning models.
在开发强化学习模型时,这是必须权衡的研究。
现实生活中的例子 (Real-Life Examples)
Reinforcement learning models have to be well-trained and optimized to navigate real-life situations. The scenarios and the environment around the agent can change every time.
强化学习模型必须经过良好的训练和优化,以应对现实生活中的情况。 代理周围的场景和环境可以随时更改。
For example, we are inside a self-driving vehicle and we want the car to be optimized for safety. Then, if we see the brake lights of the car in front of us it is probably time to slow down. However, if we saw a massive rock on the road, we expect that the car will stop.
例如,我们在自动驾驶汽车内,希望对汽车进行安全性优化。 然后,如果我们看到前方的汽车刹车灯,可能是时候减速了。 但是,如果我们在路上看到一块大石头,我们希望汽车会停下来。
Another impressive project was aimed at building prosthetic legs, which will be able to recognize the walking patterns and adjust accordingly. As a prototype, the researchers have developed a virtual runner that had to learn on its own.
另一个令人印象深刻的项目旨在建造假肢,它将能够识别步行方式并相应地进行调整。 作为原型,研究人员开发了一个虚拟跑步者,必须自己学习。
We have learned that reinforcement learning is a powerful tool with high potential. The only thing is that we need a lot of data and training to be able to navigate real-world situations.
我们已经知道,强化学习是一种具有巨大潜力的强大工具。 唯一的是,我们需要大量数据和培训才能导航现实环境。
There have been some optimistic and impressive results in the field of large-scale computing. These systems can explore massive environments with a huge number of states, such as the ones in large-scale video games.
在大规模计算领域中取得了一些乐观而令人印象深刻的结果。 这些系统可以探索具有大量状态的大规模环境,例如大型视频游戏中的状态。
最后的想法 (Final Thoughts)
Reinforcement learning is simple and powerful. Given the recent development and rapid progress, it has the potential to become a big force in the field of deep learning. However, it is important to remember that reinforcement learning is only one of the many existing approaches to solving real-world problems, which has its advantages and disadvantages.
强化学习既简单又强大。 鉴于最近的发展和Swift的进步,它有潜力成为深度学习领域的一支强大力量。 但是,重要的是要记住,强化学习只是解决现实问题的许多现有方法之一,它有其优点和缺点。
Reinforcement learning has the potential to make machines creative, as we have seen in the AlphaGo example. Getting started with this field can be as simple as taking a short course or developing a small program, such as for the robot searching for bolts.
正如我们在AlphaGo示例中所看到的,强化学习具有使机器具有创造力的潜力。 该领域的入门很简单,例如学习一门简短的课程或开发一个小程序,例如用于机器人搜索螺栓的程序。
翻译自: https://towardsdatascience.com/what-is-reinforcement-learning-99f9615918e3