Jake Gunther
2019/13/12
Estimate \(v_\pi(s)\) as the average cumulative future discounted reward starting from state \(s\)
Q: How modify for every-visit MC method?
Dynamic Programming
Monte Carlo
DP: Bellman’s equations update estimates \(v_\pi(s)\) based on other estimates \(v_\pi(s')\)
DP: This is bootstrapping
MC: Estimates \(v_\pi(s)\) are independent for each state
MC: Does not bootstrap. It averages.
After each episode, alternate between:
\[ \begin{align} \text{Pr}\{ & A_t, S_{t+1}, \cdots, S_T |S_t,A_{t:T-1} \sim \pi \} \\ &= \prod_{k=t}^{T-1} \pi(A_k|S_k) p(S_{k+1}|S_k,A_k) \end{align} \]
\[ \begin{align} \rho_{t:T-1} &= \frac{\prod_{k=t}^{T-1} \pi(A_k|S_k) p(S_{k+1}|S_k,A_k)}{\prod_{k=t}^{T-1} b(A_k|S_k) p(S_{k+1}|S_k,A_k)} \\ &= \frac{\prod_{k=t}^{T-1} \pi(A_k|S_k)}{\prod_{k=t}^{T-1} b(A_k|S_k)} \end{align} \]
Don’t have to know MDP dynamics \(p(s',r|s,a)\)
\[ \begin{align} v_\pi(s) &= E[G_t | S_t=s] \\ V(s) &= \frac{\sum_{t\in\mathcal{T}(s)} \rho_{t:T(t)-1} G_t}{|\mathcal{T}(s)|} \quad (\text{ordinary IS}) \\ &= \frac{\sum_{t\in\mathcal{T}(s)} \rho_{t:T(t)-1} G_t}{\sum_{t\in\mathcal{T}(s)} \rho_{t:T(t)-1}} \quad (\text{weighted IS}) \end{align} \]
Ordinary IS
Weighted IS
Blackjack: Off policy, single state
Why the “bump” in the weighted IS curve?
\[ \begin{align} q_\pi(s,a) &= E[G_t|S_t=s,A_t=a] \\ Q(s,a) &= ? \end{align} \]
Hints
\[ \begin{align} V_n = \frac{\sum_{k=1}^{n-1} W_kG_k}{\sum_{k=1}^{n-1} W_k} &= \frac{\sum_{k=1}^{n-1} W_kG_k}{C_{n-1}}, \quad W_i = \rho_{t_i:T(t_i)-1} \\ V_{n+1} C_n &= W_nG_n + V_n C_{n-1} \\ V_{n+1} C_n &= W_nG_n + V_n [C_{n}-W_n] \\ V_{n+1} &= V_n + \frac{W_n}{C_n}[G_n-V_n] \end{align} \]
\[ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha [G - Q(S_t,A_t)] \]
Do exercise 5.12 on pages 111-112.