Q: Give an equation for in terms of and .
A: Expand from the definition by introducing and marginalizing as follows,
Q: Given an equation for in terms of and four-argument .
A: Expand return and evaluate terms finally applying the Markov property as follows,
even though we have the definition .

Observation: The state-value recursion (Bellman's equation) is a linear system of equations.
Here is some Matlab code that does all of this. Note in particularl the way that the four-argument function is set.
xxxxxxxxxx% State is a pair (i,j) where i,j in {1,2,3,4,5}.% State pair to index: ind = (i-1)*5 + j.NS = 5*5; % Number of statesNR = 4; % Number of rewards [-1, 0, 5, 10]NA = 4; % Number of actions [up, down, right, left]R = [-1,0,5,10]; % Rewardsgamma = 0.9; % Discount factor% Create four-argument p function.% p(s', r| s, a)p = zeros(NS,NR,NS,NA); % Default probability is zero. Note: 4D table.% Set all non-zero probabilities% Loop over states sfor i = 1:5 % Row in grid for j=1:5 % Column in grid s = (i-1)*5 + j; % State index % Deal with special states first. % "Other actions result in a reward of zero, except those % that move the agent out of the special states A and B. % From state A (state 2), all four actions yield a reward % of +10 and take the agent to A' (state 22). From state B % (state 4), all actions yield a reward of +5 and take the % agent to B' (state 14)." % State A if(i==1 && j==2) r = 4; % Reward of +10 ip = 5; % Next state jp = j; % Next state sp = (ip-1)*5 + jp; % Next state index for a=1:4 p(sp,r,s,a) = 1; end continue; end % State B if(i==1 && j==4) r = 3; % Reward of +5 ip = 3; % Next state jp = j; % Next state sp = (ip-1)*5 + jp; % Next state index for a=1:4 p(sp,r,s,a) = 1; end continue; end % Deal with all the other states % Actions that would take the agent off the grid leave its % location unchanged, but also result in a reward of -1. % Up action a = 1; if (i==1) ip = i; % Next state (stay in same state) jp = j; % Next state (stay in same state) r = 1; % Reward of -1 else ip = i-1; % Next state jp = j; % Next state r = 2; % Reward of zero end sp = (ip-1)*5 + jp; % Next state index p(sp,r,s,a) = 1; % Down action a = 2; if (i==5) ip = i; % Next state (stay in same state) jp = j; % Next state (stay in same state) r = 1; % Reward of -1 else ip = i+1; % Next state jp = j; % Next state r = 2; % Reward of zero end sp = (ip-1)*5 + jp; % Next state index p(sp,r,s,a) = 1; % Right action a = 3; if (j==5) ip = i; % Next state (stay in same state) jp = j; % Next state (stay in same state) r = 1; % Reward of -1 else ip = i; % Next state jp = j+1; % Next state r = 2; % Reward of zero end sp = (ip-1)*5 + jp; % Next state index p(sp,r,s,a) = 1; % Left action a = 4; if (j==1) ip = i; % Next state (stay in same state) jp = j; % Next state (stay in same state) r = 1; % Reward of -1 else ip = i; % Next state jp = j-1; % Next state r = 2; % Reward of zero end sp = (ip-1)*5 + jp; % Next state index p(sp,r,s,a) = 1; endend% Build vector:% u = (1/4) sum_{s',r,a} r p(s',r|s,a)u = zeros(25,1);for i=1:5 for j=1:5 s = (i-1)*5 + j; for ip=1:5 for jp=1:5 sp = (ip-1)*5 + jp; for a=1:4 for r=1:4 u(s) = u(s) + R(r) * p(sp,r,s,a); end end end end endendu = u * 0.25;% Build matrix:% M(s,s') = (1/4) sum_{s',r,a} r p(s',r|s,a)M = zeros(25,25);for i=1:5 for j=1:5 s = (i-1)*5 + j; for ip=1:5 for jp=1:5 sp = (ip-1)*5 + jp; for a=1:4 for r=1:4 M(s,sp) = M(s,sp) + p(sp,r,s,a); end end end end endendM = M * (0.25*gamma);% Set up final coefficient matrixM = eye(25,25) - M;% Solve systemv = M\u;% Reshape to gridv = reshape(v,5,5).';disp(v);Here are the coefficient matrix and vector .

Here is the solution for the state-value function .
| 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|
| 1 | 3.3090 | 8.7893 | 4.4276 | 5.3224 | 1.4922 |
| 2 | 1.5216 | 2.9923 | 2.2501 | 1.9076 | 0.5474 |
| 3 | 0.0508 | 0.7382 | 0.6731 | 0.3582 | -0.4031 |
| 4 | -0.9736 | -0.4355 | -0.3549 | -0.5856 | -1.1831 |
| 5 | -1.8577 | -1.3452 | -1.2293 | -1.4229 | -1.9752 |

As the number of states, actions, and rewards increase, this method of solving becomes intractable.
Q: Why is the value for state (state A) less than the reward while the value for state (state B) greater than the reward ?