Reinforcement Learning Exercise 4.1

Example 4.1 Consider the 4×44 \times 4 gridworld shown below.
Reinforcement Learning Exercise 4.1
The nonterminal states are S={1,2,...,14}\mathcal S = \{1, 2, . . . , 14\}. There are four actions possible in each state, A={up,down,right,left}\mathcal A = \{up, down, right, left\}, which deterministically cause the corresponding state transitions, except that actions that would take the agent off the grid in fact leave the state unchanged. Thus, for instance, p(6,.15,right)=1p(6,.1\mid 5, right) = 1, p(7,.17,right)=1p(7,.1\mid 7, right) = 1, and p(10,r5,right)=0p(10, r \mid 5, right) = 0 for all rRr \in \mathcal R. This is an undiscounted, episodic task. The reward is 1-1 on all transitions until the terminal state is reached. The terminal state is shaded in the figure (although it is shown in two places, it is formally one state). The expected reward function is thus r(s,a,s0)=1r(s, a, s0) = -1 for all states ss, ss' and actions aa. Suppose the agent follows the equiprobable random policy (all actions equally likely). The left side of Figure 4.1 shows the sequence of value functions vk{v_k} computed by iterative policy evaluation. The final estimate is in fact vπv_\pi, which in this case gives for each state the negation of the expected number of steps from that state until termination.
Reinforcement Learning Exercise 4.1

Exercise 4.1 In Example 4.1, if π\pi is the equiprobable random policy, what is qπ(11,down)q_\pi(11, down)? What is qπ(7,down)q_\pi(7, down)?
Here, we can use the result of exercise 3.13 for calculation. The result of exercise 3.13 is :
qπ(s,a)=s{Ps,sa[Rs,sa+γvπ(s)]} q_\pi(s,a) = \sum_{s'} \biggl \{ P_{s,s'}^a \cdot \Bigl [ R_{s,s'}^a + \gamma \cdot v_\pi(s') \Bigr ] \biggr \}
where Ps,sa=p(ss,a)P_{s, s'}^a = p(s' \mid s,a) and Rs,sa=Eπ(rs,a,s)R_{s,s'}^a = \mathbb E_\pi(r \mid s,a,s').
So,
qπ(11,down)=P11,7down[R11,7down+γvπ(7)]+P11,10down[R11,10down+γvπ(10)]+P11,11down[R11,11down+γvπ(11)]+P11,terminaldown[R11,terminaldown+γvπ(terminal)]=0+0+0+p(terminal11,down){s[rp(r11,down,s)]+γvπ(terminal)}=1{[(1)p(111,down,7)+(1)p(111,down,10)+(1)p(111,down,11)+0p(011,down,terminal)]+γ0}=1{[(1)0+(1)0+(1)0+01]+γ0}=0 \begin{aligned} q_\pi(11,down) &= P_{11, 7}^{down} \cdot \Bigl [ R_{11, 7}^{down} + \gamma \cdot v_\pi(7)\Bigr ] + P_{11, 10}^{down} \cdot \Bigl [ R_{11, 10}^{down} + \gamma \cdot v_\pi(10)\Bigr ] \\ & \quad + P_{11, 11}^{down} \cdot \Bigl [ R_{11, 11}^{down} + \gamma \cdot v_\pi(11)\Bigr ] + P_{11, terminal}^{down} \cdot \Bigl [ R_{11, terminal}^{down} + \gamma \cdot v_\pi(terminal)\Bigr ] \\ &= 0+0+0+p(terminal \mid 11, down) \cdot \biggl \{ \sum_{s'} \Bigl [r\cdot p(r \mid 11, down, s') \Bigr ] + \gamma \cdot v_\pi(terminal)\biggr \} \\ &=1 \cdot \biggl \{ \Bigl [ (-1)\cdot p(-1 \mid 11, down, 7) + (-1) \cdot p(-1 \mid 11, down, 10) \\ &\quad + (-1) \cdot p(-1 \mid 11, down, 11) + 0 \cdot p(0 \mid 11, down, terminal)\Bigr ] + \gamma \cdot 0\biggr \} \\ &=1 \cdot \biggl \{ \Bigl [ (-1)\cdot 0 + (-1) \cdot 0 + (-1) \cdot 0 + 0 \cdot 1\Bigr ] + \gamma \cdot 0\biggr \} \\ &= 0 \end{aligned}
qπ(7,down)=P7,3down[R7,3down+γvπ(3)]+P7,6down[R7,6down+γvπ(6)]+P7,11down[R7,11down+γvπ(11)]+P7,7down[R7,7down+γvπ(7)]=0+0+p(711,down){s[rp(r7,down,s)]+γvπ(11)}+0=1{[(1)p(17,down,7)+(1)p(17,down,3)+(1)p(17,down,6)+(1)p(17,down,11)]+γ(14)}=1{[(1)0+(1)0+(1)0+(1)1]γ14}=1γ14 \begin{aligned} q_\pi(7,down) &= P_{7, 3}^{down} \cdot \Bigl [ R_{7, 3}^{down} + \gamma \cdot v_\pi(3)\Bigr ] + P_{7, 6}^{down} \cdot \Bigl [ R_{7, 6}^{down} + \gamma \cdot v_\pi(6)\Bigr ] \\ & \quad + P_{7, 11}^{down} \cdot \Bigl [ R_{7, 11}^{down} + \gamma \cdot v_\pi(11)\Bigr ] + P_{7, 7}^{down} \cdot \Bigl [ R_{7, 7}^{down} + \gamma \cdot v_\pi(7)\Bigr ] \\ &= 0+0+p(7 \mid 11, down) \cdot \biggl \{ \sum_{s'} \Bigl [r\cdot p(r \mid 7, down, s') \Bigr ] + \gamma \cdot v_\pi(11)\biggr \} + 0 \\ &=1 \cdot \biggl \{ \Bigl [ (-1)\cdot p(-1 \mid 7, down, 7) + (-1) \cdot p(-1 \mid 7, down, 3) \\ &\quad + (-1) \cdot p(-1 \mid 7, down, 6) + (-1) \cdot p(-1 \mid 7, down, 11)\Bigr ] + \gamma \cdot (-14)\biggr \} \\ &=1 \cdot \biggl \{ \Bigl [ (-1)\cdot 0 + (-1) \cdot 0 + (-1) \cdot 0 + (-1) \cdot 1\Bigr ] - \gamma \cdot 14\biggr \} \\ &= -1 -\gamma \cdot 14 \end{aligned}