Chapter 3 Notes

$v_\pi$ in terms of $q_\pi$ and $\pi$

Q: Give an equation for $v_\pi$ in terms of $q_\pi$ and $\pi$ .

A: Expand from the definition by introducing $a$ and marginalizing as follows,

\begin{align} v_\pi(s) &= E[G_t | S_t=s] \\ &= \sum_g g p(g|s) \\ &= \sum_g g \sum_a p(g,a|s) \\ &= \sum_a \sum_g g p(g|s,a) \pi(a|s) \\ &= \sum_a E[G_t | S_t=s,A_t=a] \pi (a|s) \\ &= \sum_a q_\pi(s,a) \pi (s|a) \end{align}

$q_\pi$ in terms of $v$ and $p$

Q: Given an equation for $q_\pi$ in terms of $v_\pi$ and four-argument $p$ .

A: Expand return and evaluate terms finally applying the Markov property as follows,

\begin{align} q_\pi(s,a) &= E[G_t|S_t=s, A_t=a] \\ &= E[ R_{t+1} + \gamma G_{t+1}| S_t=s,A_t=a] \\ &= E[ R_{t+1}| S_t=s,A_t=a] + \gamma E[G_{t+1}| S_t=s,A_t=a] \\ &= \sum_r r p(r|s,a) + \gamma \sum_g p(g|s,a) \\ &= \sum_{s'} \left[ \sum_r r p(s',r|s,a) + \gamma \sum_g g p(g,s'|s,a) \right] \\ &= \sum_{s'} \left[ \sum_r r p(s',r|s,a) + \gamma \sum_g g \frac{p(g,s',s,a)}{p(s,a)} \frac{p(s',s,a)}{p(s',s,a)} \right] \\ &= \sum_{s'} \left[ \sum_r r p(s',r|s,a) + \gamma \sum_g g \frac{p(g,s',s,a)}{p(s',s,a)} \frac{p(s',s,a)}{p(s,a)} \right] \\ &= \sum_{s'} \left[ \sum_r r p(s',r|s,a) + \gamma \sum_g g p(g|s',s,a) \sum_r p(s',r|s,a) \right] \\ &= \sum_{s'} \sum_r \left[ r p(s',r|s,a) + \gamma \sum_g g p(g|s') p(s',r|s,a) \right]\\ &= \sum_{s'} \sum_r \left[ r + \gamma v_\pi(s') \right]p(s',r|s,a) \\ &= E\left[ R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t=s, A_t=a \right] \end{align}

Notice that we found

E[ G_{t+1}| S_t=s,A_t=a] = E\left[ v_\pi(S_{t+1}) | S_t=s, A_t=a \right]

even though we have the definition $v_\pi(s) = E[G_t|S_t=s]$ .

Gridworld Example

Uniform policy $\pi(a|s) = \frac{1}{4}$ for all $a$ and $s$ .
Actions that would take the agent off the grid leave its location unchanged, but also result in a reward of -1.
Other actions result in a reward of zero, except those that move the agent out of the special states A and B. From state A (state 2), all four actions yield a reward of +10 and take the agent to A' (state 22). From state B (state 4), all actions yield a reward of +5 and take the agent to B' (state 14).

Observation: The state-value recursion (Bellman's equation) is a linear system of equations.

\begin{align} v_\pi(s) &= \sum_a \pi(a|s) \sum_{s',r} \left[ r + \gamma v_\pi(s') \right]p(s',r|s,a) \\ &= \frac{1}{4} \sum_{s',r,a} r p(s',r|s,a) + \sum_{s'} \left[ \frac{\gamma}{4} \sum_{r,a} p(s',r|s,a) \right] v_\pi(s') \\ v_\pi(s) &= u(s) + \sum_{s'} M(s,s') v_\pi(s') \\ \mathbf{v}_\pi &= \mathbf{u} + \mathbf{M} \mathbf{v}_\pi \\ (\mathbf{I} - \mathbf{M}) \mathbf{v}_\pi &= \mathbf{u} \\ \mathbf{v}_\pi &= (\mathbf{I} - \mathbf{M})^{-1}\mathbf{u} \end{align}

Here is some Matlab code that does all of this. Note in particularl the way that the four-argument $p$ function is set.


xxxxxxxxxx
% State is a pair (i,j) where i,j in {1,2,3,4,5}.
% State pair to index: ind = (i-1)*5 + j.
NS = 5*5; % Number of states
NR = 4; % Number of rewards [-1, 0, 5, 10]
NA = 4; % Number of actions [up, down, right, left]
R = [-1,0,5,10]; % Rewards
gamma = 0.9; % Discount factor
% Create four-argument p function.
%       p(s', r| s, a)
p = zeros(NS,NR,NS,NA); % Default probability is zero. Note: 4D table.
% Set all non-zero probabilities
% Loop over states s
for i = 1:5 % Row in grid
    for j=1:5 % Column in grid
        s = (i-1)*5 + j; % State index
        % Deal with special states first.
        
        % "Other actions result in a reward of zero, except those
        % that move the agent out of the special states A and B.
        % From state A (state 2), all four actions yield a reward
        % of +10 and take the agent to A' (state 22).  From state B
        % (state 4), all actions yield a reward of +5 and take the
        % agent to B' (state 14)."
        
        % State A
        if(i==1 && j==2)
            r = 4;  % Reward of +10
            ip = 5; % Next state
            jp = j; % Next state
            sp = (ip-1)*5 + jp; % Next state index
            for a=1:4
                p(sp,r,s,a) = 1;
            end
            continue;
        end
        
        % State B
        if(i==1 && j==4)
            r = 3;  % Reward of +5
            ip = 3; % Next state
            jp = j; % Next state
            sp = (ip-1)*5 + jp; % Next state index
            for a=1:4
                p(sp,r,s,a) = 1;
            end
            continue;
        end
        % Deal with all the other states
        
        % Actions that would take the agent off the grid leave its
        % location unchanged, but also result in a reward of -1.
 
        % Up action
        a = 1;
        if (i==1)
            ip = i; % Next state (stay in same state)
            jp = j; % Next state (stay in same state)
            r  = 1; % Reward of -1
        else
            ip = i-1; % Next state
            jp = j; % Next state
            r  = 2; % Reward of zero
        end
        sp = (ip-1)*5 + jp; % Next state index
        p(sp,r,s,a) = 1;
        
        % Down action
        a = 2;
        if (i==5)
            ip = i; % Next state (stay in same state)
            jp = j; % Next state (stay in same state)
            r  = 1; % Reward of -1
        else
            ip = i+1; % Next state
            jp = j; % Next state
            r  = 2; % Reward of zero
        end
        sp = (ip-1)*5 + jp; % Next state index
        p(sp,r,s,a) = 1;
        % Right action
        a = 3;
        if (j==5)
            ip = i; % Next state (stay in same state)
            jp = j; % Next state (stay in same state)
            r  = 1; % Reward of -1
        else
            ip = i; % Next state
            jp = j+1; % Next state
            r  = 2; % Reward of zero
        end
        sp = (ip-1)*5 + jp; % Next state index
        p(sp,r,s,a) = 1;
        % Left action
        a = 4;
        if (j==1)
            ip = i; % Next state (stay in same state)
            jp = j; % Next state (stay in same state)
            r  = 1; % Reward of -1
        else
            ip = i; % Next state
            jp = j-1; % Next state
            r  = 2; % Reward of zero
        end
        sp = (ip-1)*5 + jp; % Next state index
        p(sp,r,s,a) = 1;
        
    end
end
% Build vector:
% u = (1/4) sum_{s',r,a} r p(s',r|s,a)
u = zeros(25,1);
for i=1:5
    for j=1:5
        s = (i-1)*5 + j;
        for ip=1:5
            for jp=1:5
                sp = (ip-1)*5 + jp;
                for a=1:4
                    for r=1:4
                        u(s) = u(s) + R(r) * p(sp,r,s,a);
                    end
                end
            end
        end
    end
end
u = u * 0.25;
% Build matrix:
% M(s,s') = (1/4) sum_{s',r,a} r p(s',r|s,a)
M = zeros(25,25);
for i=1:5
    for j=1:5
        s = (i-1)*5 + j;
        for ip=1:5
            for jp=1:5
                sp = (ip-1)*5 + jp;
                for a=1:4
                    for r=1:4
                        M(s,sp) = M(s,sp) + p(sp,r,s,a);
                    end
                end
            end
        end
    end
end
M = M * (0.25*gamma);
% Set up final coefficient matrix
M = eye(25,25) - M;
% Solve system
v = M\u;
% Reshape to grid
v = reshape(v,5,5).';
disp(v);

Here are the coefficient matrix $(\mathbf{I}-\mathbf{M})$ and vector $\mathbf{u}$ .

Here is the solution for the state-value function $v_\pi(s)$ .

$v_\pi(s)$	1	2	3	4	5
1	3.3090	8.7893	4.4276	5.3224	1.4922
2	1.5216	2.9923	2.2501	1.9076	0.5474
3	0.0508	0.7382	0.6731	0.3582	-0.4031
4	-0.9736	-0.4355	-0.3549	-0.5856	-1.1831
5	-1.8577	-1.3452	-1.2293	-1.4229	-1.9752

As the number of states, actions, and rewards increase, this method of solving becomes intractable.

Q: Why is the value for state $s=(1,2)$ (state A) less than the reward $+10$ while the value for state $s=(1,4)$ (state B) greater than the reward $+5$ ?

Chapter 3 Notes

v_\pi in terms of q_\pi and \pi

q_\pi in terms of v and p

Gridworld Example

$v_\pi$ in terms of $q_\pi$ and $\pi$

$q_\pi$ in terms of $v$ and $p$