Chapter 2 Notes

Numerics from example runs

Run 1

Greedy policy

CountR Accum.
11000956.50.95650.9316
200.00.0000-0.7628
300.00.0000-0.2383
400.00.0000-0.5191
500.00.00000.4432
600.00.00000.6307
700.00.00000.6361
800.00.0000-0.3312
900.00.00000.6138
1000.00.00000.3032

-Greedy policy with

CountR Accum.
903835.30.92510.9316
11-14.0-1.2754-0.7628
10-2.2-0.2226-0.2383
17-9.4-0.5506-0.5191
90.60.06980.4432
96.40.71640.6307
8-0.0-0.00500.6361
11-5.8-0.5279-0.3312
133.40.26120.6138
9-0.8-0.08610.3032

Run 2

Greedy policy

CountR Accum.
124-1.0-0.0400-0.4828
21-1.5-1.4800-0.9959
31-0.1-0.11490.2333
4974303.70.31180.3565
500.00.0000-0.8226
600.00.00000.1425
700.00.00001.4684
800.00.0000-0.4728
900.00.00000.5024
1000.00.0000-1.6466

-Greedy policy with

CountR Accum.
18-2.3-0.2921-0.4828
213-14.7-1.1314-0.9959
37-0.3-0.04410.2333
492.90.32700.3565
511-8.8-0.7993-0.8226
615-0.9-0.05850.1425
79131352.41.48121.4684
810-0.3-0.0337-0.4728
988.51.05900.5024
106-5.5-0.9111-1.6466

 

Observation

Based on these example runs, we can see why the final long term (after 1000 steps) reward of -greedy policy is greater than for the pure greedy policy, though the opposite can be true in the short term (first 10-20 steps).

 

Matlab code for testbed