Repeat gridworld example but use policy evaluation (prediction) for instead of solving Bellman's equations as a linear system of equations. (See original code here.)
xxxxxxxxxx
% ... reuse code that sets up p(s',r|s,a) from Chapter 3
% Inplace policy evaluation (prediction) for v_pi
num_iter = 100; % Maximum number of iterations
v = zeros(25,1); % Initial value function
for iter = 1:num_iter
vold = v; % Save to test for convergence
for i=1:5
for j=1:5
s = (i-1)*5 + j;
vs = 0; % Accumulator for inplace computation
for a=1:4
for ip=1:5
for jp=1:5
sp = (ip-1)*5 + jp;
for r=1:4
% Compute expectation suggested by Bellman's equation
vs = vs + 0.25 * p(sp,r,s,a) * (R(r) + gamma*v(sp));
end
end
end
end
v(s) = vs; % Inplace assignment
end
end
err = norm(v-vold,1); % Compute size of update
fprintf('iter = %3d, |v-vold| = %10.8f\n',iter,err);
if(err < 0.01) % Test for termination
break;
end
end
Here is the output:
xxxxxxxxxx
iter = 1, |v-vold| = 27.62459555
iter = 2, |v-vold| = 10.16786506
iter = 3, |v-vold| = 5.71330933
iter = 4, |v-vold| = 3.29504090
iter = 5, |v-vold| = 1.96247912
iter = 6, |v-vold| = 1.22910789
iter = 7, |v-vold| = 0.83593787
iter = 8, |v-vold| = 0.65833058
iter = 9, |v-vold| = 0.55351745
iter = 10, |v-vold| = 0.48344337
iter = 11, |v-vold| = 0.43265023
iter = 12, |v-vold| = 0.37922873
iter = 13, |v-vold| = 0.32782581
iter = 14, |v-vold| = 0.28069078
iter = 15, |v-vold| = 0.23871155
iter = 16, |v-vold| = 0.20202362
iter = 17, |v-vold| = 0.17036710
iter = 18, |v-vold| = 0.14329432
iter = 19, |v-vold| = 0.12028819
iter = 20, |v-vold| = 0.10082764
iter = 21, |v-vold| = 0.08442181
iter = 22, |v-vold| = 0.07062582
iter = 23, |v-vold| = 0.05904626
iter = 24, |v-vold| = 0.04934076
iter = 25, |v-vold| = 0.04121476
iter = 26, |v-vold| = 0.03441677
iter = 27, |v-vold| = 0.02873334
iter = 28, |v-vold| = 0.02398405
iter = 29, |v-vold| = 0.02001685
iter = 30, |v-vold| = 0.01670394
iter = 31, |v-vold| = 0.01393805
iter = 32, |v-vold| = 0.01162928
iter = 33, |v-vold| = 0.00970236
This gives the same result for as solving Bellman's equations as a linaer system.
Repeat gridworld example but use policy evaluation (prediction) for .
xxxxxxxxxx
% ... reuse code that sets up p(s',r|s,a) from Chapter 3
% Inplace policy evaluation (prediction) for q_pi
num_iter = 100; % Maximum number of iterations
q = zeros(25,4); % Initial value function
for iter = 1:num_iter
qold = q; % Save to test for convergence
for i=1:5
for j=1:5
s = (i-1)*5 + j;
for a=1:4
qsa = 0; % Accumulator for inplace computation
for ip=1:5
for jp=1:5
sp = (ip-1)*5 + jp;
for r=1:4
% Compute expectation suggested by Bellman's equation
qsa = qsa + p(sp,r,s,a)*(R(r) + gamma*0.25*sum(q(sp,:)));
end
end
end
q(s,a) = qsa; % Inplace assignment
end
end
end
err = sum(sum(abs(q-qold))); % Compute size of update
fprintf('iter = %3d, |q-qold| = %10.8f\n',iter,err);
if(err < 0.01) % Test for termination
break;
end
end
Here is the output:
xxxxxxxxxx
iter = 1, |q-qold| = 123.36041637
iter = 2, |q-qold| = 53.58055428
iter = 3, |q-qold| = 26.93173329
iter = 4, |q-qold| = 14.48878524
iter = 5, |q-qold| = 7.97903765
iter = 6, |q-qold| = 4.81052880
iter = 7, |q-qold| = 3.27817039
iter = 8, |q-qold| = 2.64359683
iter = 9, |q-qold| = 2.34263687
iter = 10, |q-qold| = 2.05074371
iter = 11, |q-qold| = 1.74947287
iter = 12, |q-qold| = 1.46812616
iter = 13, |q-qold| = 1.21869089
iter = 14, |q-qold| = 1.00418227
iter = 15, |q-qold| = 0.82320700
iter = 16, |q-qold| = 0.67242938
iter = 17, |q-qold| = 0.54787302
iter = 18, |q-qold| = 0.44557900
iter = 19, |q-qold| = 0.36191219
iter = 20, |q-qold| = 0.29367901
iter = 21, |q-qold| = 0.23814762
iter = 22, |q-qold| = 0.19302073
iter = 23, |q-qold| = 0.15638828
iter = 24, |q-qold| = 0.12667448
iter = 25, |q-qold| = 0.10258628
iter = 26, |q-qold| = 0.08306671
iter = 27, |q-qold| = 0.06725406
iter = 28, |q-qold| = 0.05444723
iter = 29, |q-qold| = 0.04407654
iter = 30, |q-qold| = 0.03567962
iter = 31, |q-qold| = 0.02888143
iter = 32, |q-qold| = 0.02337794
iter = 33, |q-qold| = 0.01892282
iter = 34, |q-qold| = 0.01531649
iter = 35, |q-qold| = 0.01239732
iter = 36, |q-qold| = 0.01003444
iter = 37, |q-qold| = 0.00812186
Here is the solution.
s | ||||
---|---|---|---|---|
1 | 1.9786 | 1.3699 | 7.9108 | 1.9785 |
2 | 8.7897 | 8.7897 | 8.7897 | 8.7897 |
3 | 2.9853 | 2.0255 | 4.7905 | 7.9107 |
4 | 5.3227 | 5.3227 | 5.3227 | 5.3227 |
5 | 0.3434 | 0.4930 | 0.3433 | 4.7905 |
6 | 2.9785 | 0.0462 | 2.6935 | 0.3698 |
7 | 7.9107 | 0.6648 | 2.0255 | 1.3698 |
8 | 3.9852 | 0.6062 | 1.7172 | 2.6934 |
9 | 4.7905 | 0.3227 | 0.4930 | 2.0255 |
10 | 1.3433 | -0.3625 | -0.5070 | 1.7171 |
11 | 1.3698 | -0.8758 | 0.6648 | -0.9539 |
12 | 2.6934 | -0.3916 | 0.6062 | 0.0461 |
13 | 2.0255 | -0.3190 | 0.3227 | 0.6647 |
14 | 1.7171 | -0.5267 | -0.3625 | 0.6061 |
15 | 0.4929 | -1.0645 | -1.3625 | 0.3227 |
16 | 0.0461 | -1.6715 | -0.3916 | -1.8759 |
17 | 0.6647 | -1.2103 | -0.3190 | -0.8759 |
18 | 0.6061 | -1.1060 | -0.5267 | -0.3916 |
19 | 0.3227 | -1.2803 | -1.0645 | -0.3191 |
20 | -0.3626 | -1.7774 | -2.0645 | -0.5268 |
21 | -0.8759 | -2.6716 | -1.2103 | -2.6716 |
22 | -0.3916 | -2.2104 | -1.1060 | -1.6716 |
23 | -0.3191 | -2.1060 | -1.2803 | -1.2104 |
24 | -0.5268 | -2.2803 | -1.7774 | -1.1061 |
25 | -1.0645 | -2.7774 | -2.7774 | -1.2804 |
Does the policy derived from agree with the policy derived from ?
Which is simpler to for policy determination: or ?
Do the solutions agree mathematically? Do some spot checking using hand calculations.
= expected return starting in state and following policy
= expected return taking action from state and following policy thereafter
, ( is average/expectation of )
lets us examine the value of taking specific action out of a state
Policy improvement: Adapt policy to put more probability on more valuable actions
Policy is a conditional probability density function .
A stochastic policy places non-zero probability mass on multiple actions.
A deterministic places all probability mass on one action.
Decision function is defined as
For a deterministic policy, is the action chosen by the policy with probability one,
For a deterministic policy, we have the simple state-value/action-value relation,
For some other deterministic policy , is the value of following deterministic policy right now in state and hereafter following deterministic policy .
Policy is better than policy if for all .
For deterministic policies, we need for all .
Given two deterministic policies and that choose different actions in one state, suppose that
This states that it is better to follow policy in state and thereafter follow policy . By repeatedly applying this decision, we can show that it is always better to follow policy than in the sense that for all .
This is proved by repeatedly appling the superiority of over in state and the expansion formula
which is true for all policies (stochastic or deterinistic).
We have shown that is better than .
Policy improvement step: Define a new greedy policy for which for all states.
Suppose as good as but not better than . Then it is optimal .
Homework: Prove the policy improvement theorem in the stochastic case.
Obtain a sequence of monotonically improving policy and value functions by iterating policy evaluation (prediction) and policy improvement (greedy action selection).
Must converge in a finite number of iterations.
Homework: Modify codes given (or write code from scratch) to implement policy iteration for the grid world example 3.5 and compare your results to those in example 3.8 on page 65.