Monte Carlo Tree Search
The process of Upper Confidence Bounds for Trees (UCT)
1. Selection
Starting at the root node R, search a node L which is not expanded fully.
Specifically, we use UCB(upper confidence bound) algorithm to choose the descendant node with the most potential.
v is the parent node of the v’.
Q(v’) represent the value of v’, which is ruled by yourself.
N(v’) represent the visit times of v’. N(v) represent the visit times of the parent node of v’.
Cp is the constant that trade off exploration and exploitation.
From the root R:
- If all child nodes of R are visited, we choose the child node which has the maximum UCB value.
- Otherwise, we randomly choose a child node which has not been visited.
2. Expansion
After selection, we choose the node L. If L isn’t a terminal node(it means that the game is not over yet), randomly create a node as its child node.
The node which is created represent the next step of the game.
If L is a terminal node, then return the current node L. Because it is already a exit status.
3. Simulation
Get node c from the above two steps. And starting from node c, simulate the game until the end of the game.
We can randomly play until we can decide the winner.
4.Backpropagation
In the step 3, we get a result from the simulation of the game. And we must use the result to update the node.
From the node c to the root R, we update all the node as follows:
N(v) = N(v) + 1
Q(v) = Q(v) + reward
The reason we use MCTS
Monte carlo tree search is based on sampling to get results, rather than exhaustive enumeration.
Although we can’t enumerate all the game cases, the results is still fairly reliable.
Pseudocode
Code
to be continue…