Jake Gunther
2019/13/12
\[ V(S_t) \leftarrow V(S_t) + \alpha[G_t-V(S_t)] \]
\[ V(S_t) \leftarrow V(S_t) + \alpha[R_{t+1} + \gamma V(S_{t+1}) -V(S_t)] \]
\[ \begin{alignat}{2} v_\pi(s) &= E_\pi[G_t |S_t=s] \quad & &(\text{MC target}) \\ &= E_\pi[R_{t+1}+ \gamma G_t |S_t=s] \\ &= E_\pi[R_{t+1} + \gamma v_\pi(S_{t+1}) |S_t=s] \quad & &(\text{TD target}) \end{alignat} \]
For stationary MDP and fixed policy \(\pi\), TD prediction (learning \(v_\pi\))
In practice, TD methods converge faster than constant-\(\alpha\) MC on stochastic tasks
Homework: Do exercises 6.3 and 6.4 and 6.6.
See example on page 127-128.
What answers do TD and MC give?
Given observed episodes:
Batch TD(0) converges to the certainty-equivalent estimate (CEE)
Start exercise 6.7 (page 128-129) in class and finish as homework.
\[ \begin{align} Q(S_t,A_t) \leftarrow& Q(S_t,A_t) \\ &+ \alpha \left[ R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) - Q(S_t,A_t)\right] \end{align} \]
?
SARSA converges with prob. 1 to \(\pi_\ast\) provided:
Do exercise 6.8 (page 129).
Do Exercises 6.9 and 6.10 (page 130-131).
Q-learning = Off-policy TD control
\[ \begin{align} Q(S_t,A_t) \leftarrow& Q(S_t,A_t) \\ &+ \alpha \left[ R_{t+1} + \gamma \max_{a'} Q(S_{t+1},a') - Q(S_t,A_t)\right] \end{align} \]
Guaranteed with prob. 1 provide all state-action pairs are visited
Why is this considered off-policy?
Compare the performance of SARSA and Q-learning in examples 6.6 (page 132).
Why does SARSA have better rewards?
Why does SARSA choose the safer path?
\[ \begin{alignat}{2} Q&(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha [R_{t+1} +& & \gamma \Phi - Q(S_t, A_t)] \\[1em] \Phi &= Q(S_{t+1},A_{t+1}) & & \text{SARSA}\\[1em] \Phi &= \max_{a'}Q(S_{t+1},a') & & \text{Q-learning}\\[1em] \Phi &= \sum_{a'} \pi(a'|S_{t+1})Q(S_{t+1},a') \;\;& & \text{Expected SARSA} \end{alignat} \]
\[ \begin{alignat}{2} \Phi &= Q(S_{t+1},A_{t+1}) & & \text{SARSA}\\[1em] \Phi &= \sum_{a'} \pi(a'|S_{t+1})Q(S_{t+1},a') \;\;& & \text{Expected SARSA} \end{alignat} \]
Why does Q-learning choose left (wrong choice)?
\[ \begin{align} Q_1(S_t,A_t)&\leftarrow Q_1(S_t,A_t) + \alpha [ R_{t+1} \\ &+\gamma Q_2(S_{t+1},\arg \max_{a'} Q_1(S_{t+1},a')) \\ &- Q_1(S_t,A_t) ] \end{align} \]
Getting the value from \(Q_2\) of the greedy decision from \(Q_1\) to update \(Q_1\) eliminates maximization bias.