Q: Give an equation for in terms of and .
A: Expand from the definition by introducing and marginalizing as follows,
Q: Given an equation for in terms of and four-argument .
A: Expand return and evaluate terms finally applying the Markov property as follows,
even though we have the definition .
Observation: The state-value recursion (Bellman's equation) is a linear system of equations.
Here is some Matlab code that does all of this. Note in particularl the way that the four-argument function is set.
xxxxxxxxxx
% State is a pair (i,j) where i,j in {1,2,3,4,5}.
% State pair to index: ind = (i-1)*5 + j.
NS = 5*5; % Number of states
NR = 4; % Number of rewards [-1, 0, 5, 10]
NA = 4; % Number of actions [up, down, right, left]
R = [-1,0,5,10]; % Rewards
gamma = 0.9; % Discount factor
% Create four-argument p function.
% p(s', r| s, a)
p = zeros(NS,NR,NS,NA); % Default probability is zero. Note: 4D table.
% Set all non-zero probabilities
% Loop over states s
for i = 1:5 % Row in grid
for j=1:5 % Column in grid
s = (i-1)*5 + j; % State index
% Deal with special states first.
% "Other actions result in a reward of zero, except those
% that move the agent out of the special states A and B.
% From state A (state 2), all four actions yield a reward
% of +10 and take the agent to A' (state 22). From state B
% (state 4), all actions yield a reward of +5 and take the
% agent to B' (state 14)."
% State A
if(i==1 && j==2)
r = 4; % Reward of +10
ip = 5; % Next state
jp = j; % Next state
sp = (ip-1)*5 + jp; % Next state index
for a=1:4
p(sp,r,s,a) = 1;
end
continue;
end
% State B
if(i==1 && j==4)
r = 3; % Reward of +5
ip = 3; % Next state
jp = j; % Next state
sp = (ip-1)*5 + jp; % Next state index
for a=1:4
p(sp,r,s,a) = 1;
end
continue;
end
% Deal with all the other states
% Actions that would take the agent off the grid leave its
% location unchanged, but also result in a reward of -1.
% Up action
a = 1;
if (i==1)
ip = i; % Next state (stay in same state)
jp = j; % Next state (stay in same state)
r = 1; % Reward of -1
else
ip = i-1; % Next state
jp = j; % Next state
r = 2; % Reward of zero
end
sp = (ip-1)*5 + jp; % Next state index
p(sp,r,s,a) = 1;
% Down action
a = 2;
if (i==5)
ip = i; % Next state (stay in same state)
jp = j; % Next state (stay in same state)
r = 1; % Reward of -1
else
ip = i+1; % Next state
jp = j; % Next state
r = 2; % Reward of zero
end
sp = (ip-1)*5 + jp; % Next state index
p(sp,r,s,a) = 1;
% Right action
a = 3;
if (j==5)
ip = i; % Next state (stay in same state)
jp = j; % Next state (stay in same state)
r = 1; % Reward of -1
else
ip = i; % Next state
jp = j+1; % Next state
r = 2; % Reward of zero
end
sp = (ip-1)*5 + jp; % Next state index
p(sp,r,s,a) = 1;
% Left action
a = 4;
if (j==1)
ip = i; % Next state (stay in same state)
jp = j; % Next state (stay in same state)
r = 1; % Reward of -1
else
ip = i; % Next state
jp = j-1; % Next state
r = 2; % Reward of zero
end
sp = (ip-1)*5 + jp; % Next state index
p(sp,r,s,a) = 1;
end
end
% Build vector:
% u = (1/4) sum_{s',r,a} r p(s',r|s,a)
u = zeros(25,1);
for i=1:5
for j=1:5
s = (i-1)*5 + j;
for ip=1:5
for jp=1:5
sp = (ip-1)*5 + jp;
for a=1:4
for r=1:4
u(s) = u(s) + R(r) * p(sp,r,s,a);
end
end
end
end
end
end
u = u * 0.25;
% Build matrix:
% M(s,s') = (1/4) sum_{s',r,a} r p(s',r|s,a)
M = zeros(25,25);
for i=1:5
for j=1:5
s = (i-1)*5 + j;
for ip=1:5
for jp=1:5
sp = (ip-1)*5 + jp;
for a=1:4
for r=1:4
M(s,sp) = M(s,sp) + p(sp,r,s,a);
end
end
end
end
end
end
M = M * (0.25*gamma);
% Set up final coefficient matrix
M = eye(25,25) - M;
% Solve system
v = M\u;
% Reshape to grid
v = reshape(v,5,5).';
disp(v);
Here are the coefficient matrix and vector .
Here is the solution for the state-value function .
1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|
1 | 3.3090 | 8.7893 | 4.4276 | 5.3224 | 1.4922 |
2 | 1.5216 | 2.9923 | 2.2501 | 1.9076 | 0.5474 |
3 | 0.0508 | 0.7382 | 0.6731 | 0.3582 | -0.4031 |
4 | -0.9736 | -0.4355 | -0.3549 | -0.5856 | -1.1831 |
5 | -1.8577 | -1.3452 | -1.2293 | -1.4229 | -1.9752 |
As the number of states, actions, and rewards increase, this method of solving becomes intractable.
Q: Why is the value for state (state A) less than the reward while the value for state (state B) greater than the reward ?