Chapter 4 Notes

Gridworld (Example 3.5) Solved by Policy Evaluation (Prediction) for

Repeat gridworld example but use policy evaluation (prediction) for instead of solving Bellman's equations as a linear system of equations. (See original code here.)

Here is the output:

This gives the same result for as solving Bellman's equations as a linaer system.

Gridworld (Example 3.5) Solved by Policy Evaluation (Prediction) for

Repeat gridworld example but use policy evaluation (prediction) for .

Here is the output:

Here is the solution.

s
11.97861.36997.91081.9785
28.78978.78978.78978.7897
32.98532.02554.79057.9107
45.32275.32275.32275.3227
50.34340.49300.34334.7905
62.97850.04622.69350.3698
77.91070.66482.02551.3698
83.98520.60621.71722.6934
94.79050.32270.49302.0255
101.3433-0.3625-0.50701.7171
111.3698-0.87580.6648-0.9539
122.6934-0.39160.60620.0461
132.0255-0.31900.32270.6647
141.7171-0.5267-0.36250.6061
150.4929-1.0645-1.36250.3227
160.0461-1.6715-0.3916-1.8759
170.6647-1.2103-0.3190-0.8759
180.6061-1.1060-0.5267-0.3916
190.3227-1.2803-1.0645-0.3191
20-0.3626-1.7774-2.0645-0.5268
21-0.8759-2.6716-1.2103-2.6716
22-0.3916-2.2104-1.1060-1.6716
23-0.3191-2.1060-1.2803-1.2104
24-0.5268-2.2803-1.7774-1.1061
25-1.0645-2.7774-2.7774-1.2804

Does the policy derived from agree with the policy derived from ?

Which is simpler to for policy determination: or ?

Do the solutions agree mathematically? Do some spot checking using hand calculations.

Policy Improvement

Background

= expected return starting in state and following policy

= expected return taking action from state and following policy thereafter

, ( is average/expectation of )

lets us examine the value of taking specific action out of a state

Goal

Policy improvement: Adapt policy to put more probability on more valuable actions

Deterministic Policies

Policy is a conditional probability density function .

A stochastic policy places non-zero probability mass on multiple actions.

A deterministic places all probability mass on one action.

Decision function is defined as

For a deterministic policy, is the action chosen by the policy with probability one,

For a deterministic policy, we have the simple state-value/action-value relation,

For some other deterministic policy , is the value of following deterministic policy right now in state and hereafter following deterministic policy .

Policy Improvement

Policy is better than policy if for all .

For deterministic policies, we need for all .

Policy Improvement Theorem (Deterministic Policies)

Given two deterministic policies and that choose different actions in one state, suppose that

This states that it is better to follow policy in state and thereafter follow policy . By repeatedly applying this decision, we can show that it is always better to follow policy than in the sense that for all .

This is proved by repeatedly appling the superiority of over in state and the expansion formula

which is true for all policies (stochastic or deterinistic).

We have shown that is better than .

Policy improvement step: Define a new greedy policy for which for all states.

Suppose as good as but not better than . Then it is optimal .

Homework: Prove the policy improvement theorem in the stochastic case.

 

Policy Iteration

Obtain a sequence of monotonically improving policy and value functions by iterating policy evaluation (prediction) and policy improvement (greedy action selection).

Must converge in a finite number of iterations.

Homework: Modify codes given (or write code from scratch) to implement policy iteration for the grid world example 3.5 and compare your results to those in example 3.8 on page 65.