Jake Gunther
2019/13/12
“Any problem of learning goal-directed behavior can be reduced to three signals passing back and forth between an agent and its environment.”
\[ \begin{gather} p(s',r | s, a) = \text{Pr}\left\{\begin{array}{c}S_t=s' \\ R_t=r\end{array} \bigg|\begin{array}{c} S_{t-1}=s \\ A_{t-1}=a\end{array}\right\} \\[1em] \sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{R}} p(s',r|s,a) = 1, \;\;\forall \;\; s\in \mathcal{S}, a\in \mathcal{A}(s) \end{gather} \]
\[ (S_t,A_t) \mapsto S_{t+1} \]
Do these reward the right thing?
Choose \(A_t\) to maximize discounted return \(G_t = \sum_{k=0}^\infty \gamma^k R_{t+k+1}\)
\(\gamma \in [0,1]\) is the discount rate
Discounting puts diminishing weight on future rewards
\(\gamma=0\) (myopic agent) focuses attention on immediate reward \(R_{t+1}\)
\(\gamma \rightarrow 1\) (farsighted agent)
\[ \begin{align} G_t &= \sum_{k=0}^\infty \gamma^k R_{t+k+1} = R_{t+1} + \gamma G_{t+1} \quad \forall \quad t<T \\ G_t &= \sum_{k=0}^\infty \gamma^k = \frac{1}{1-\gamma}, \quad R_t = 1 \quad \forall \quad t \end{align} \]
\[ G_t = \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1} = \sum_{k=t+1}^T \gamma^{k-t-1} R_k \]
\[ \begin{gather} \pi(a|s) = \text{Pr}\{A_t = a | S_t=s\} \\ a \in \mathcal{A}(s), \quad s \in \mathcal{S} \end{gather} \]
\[ v_\pi(s) = E_\pi[G_t | S_t=s] = E_\pi \left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \bigg| S_t=s\right] \]
\[ \begin{align} q_\pi(s,a) &= E_\pi[G_t | S_t=s, A_t=a] \\ &= E_\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \bigg| S_t=s, A_t=a\right] \end{align} \]
\[ \begin{align} v_\pi(s) &= \sum_a q_\pi(s,a) \pi (s|a)\\ q_\pi(s,a) &= \sum_{s'} \sum_r \left[ r + \gamma v_\pi(s') \right]p(s',r|s,a) \end{align} \]
\[ v_\pi(s) = \sum_a q_\pi(s,a) \pi (s|a) \]
\[ q_\pi(s,a) = \sum_{s'} \sum_r \left[ r + \gamma v_\pi(s') \right]p(s',r|s,a) \]
Substitute \(v_\pi\)-\(q_\pi\) relations into one another to obtain recursions.
\[ \begin{align} v_\pi(s) &= \sum_a \pi(a|s) \sum_{s',r} \left[ r + \gamma v_\pi(s') \right]p(s',r|s,a) \\ q_\pi(s,a) &= \sum_{s',r} \left[ r + \gamma \sum_{a'} q_\pi(s',a') \pi(s'|a') \right]p(s',r|s,a) \end{align} \]
\[ v_\pi(s) = \sum_a \pi(a|s) \sum_{s',r} \left[ r + \gamma v_\pi(s') \right]p(s',r|s,a) \]
Graphical representation of \(v_\pi\) recursion from \(s'\) to \(s\)
\[ \pi \geq \pi' \quad \Leftrightarrow \quad v_\pi(s) \geq v_{\pi'}(s) \;\;\forall\;\;s \in \mathcal{S} \]
\[ \begin{align} v_\ast(s) &= \max_\pi v_\pi(s) \;\;\forall\;\;s\in\mathcal{S} \\ q_\ast(s,a) &= \max_\pi q_\pi(s,a) \;\;\forall\;\;s\in\mathcal{S} \;\;\text{and}\;\; a\in\mathcal{A}(s) \end{align} \]
\[ \begin{gather} \pi_\ast(a|s) = \mathbb{1}_{a=a_\ast} = \begin{cases} 1, & a=a_\ast \\ 0, & \text{otherwise} \end{cases} \\[1em] v_\ast(s) = \sum_{a\in\mathcal{A}(s)} \pi_\ast(a|s) q_\ast(s,a) = \max_{a\in\mathcal{A}(s)} q_\ast(s,a) \end{gather} \]
Leveraging this derivation and \(\pi=\pi_\ast\), we have
\[ \begin{align} v_\ast(s) &= \max_a q_\ast(s,a) \\ &= \max_a E[R_{t+1} + \gamma G_{t+1} | S_t=s,A_t=a] \\ &= \max_a E[R_{t+1} + \gamma v_\ast(S_{t+1}) | S_t=s, A_t=a] \\ &= \max_a \sum_{s',r} p(s',r|s,a) [r + \gamma v_\ast(s')] \end{align} \]
\[ \begin{align} v_\pi(s) &= \sum_a \pi(a|s) \sum_{s',r} \left[ r + \gamma v_\pi(s') \right]p(s',r|s,a) \\ v_\ast(s) &= \max_a \sum_{s',r} \left[ r + \gamma v_\ast(s') \right]p(s',r|s,a) \end{align} \]
\[ \sum_a \pi_\ast(a|s) \quad \overset{\text{greed}}{\longrightarrow} \quad \max_a \]
Applying greed, we have
\[ \begin{align} q_\pi(s,a) &= \sum_{s',r} p(s',r|s,a) \left[ r + \gamma \sum_{a'} \pi(s'|a') q_\pi(s',a') \right] \\ q_\ast(s,a) &= \sum_{s',r} p(s',r|s,a)\left[ r + \gamma \max_{a'} q_\ast(s',a') \right] \end{align} \]
(We don’t yet know how to find \(v_\ast\). Take this as given for now.)
\[ \begin{align} v_\ast(s) &= \max_a \sum_{s',r} \left[ r + \gamma v_\ast(s') \right]p(s',r|s,a) \\ q_\ast(s,a) &= \sum_{s',r} p(s',r|s,a)\left[ r + \gamma \max_{a'} q_\ast(s',a') \right] \end{align} \]