Jake Gunther
2019/13/12
\[ \begin{align} v_\ast(s) &= \max_a E[R_{t+1} + \gamma v_\ast(S_{t+1}) | S_t=s,A_t=a] \\ &= \max_a \sum_{s',r} p(s',r|s,a) [r+\gamma v_\ast(s')] \\ q_\ast(s,a) &= E[R_{t+1}+\gamma \max_{a'} q_\ast(S_{t+1},a') | S_t=s,A_t=a] \\ &= \sum_{s',r} p(s',r|s,a) [r + \gamma \max_{a'}q_\ast(s',a')] \end{align} \]
\[ \begin{align} v_\pi(s) &= \sum_a \pi(a|s) \sum_{s',r} \left[ r + \gamma v_\pi(s') \right]p(s',r|s,a) \\ q_\pi(s,a) &= \sum_{s',r} p(s',r|s,a) \left[ r + \gamma \sum_{a'} \pi(s'|a') q_\pi(s',a') \right] \end{align} \]
Policy evaluation (prediction) is computing \(v_\pi\) for policy \(\pi\)
Bellman’s equations (could solve as linear system):
\[ v_\pi(s) = \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a) [r+\gamma v_\pi(s')] \]
Iterative policy evaluation (expected update):
\[ v_{k+1}(s) = \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a) [r+\gamma v_k(s')] \]
\(v_k(s) \rightarrow v_\pi\)
\[ q_{k+1}(s,a) = \sum_{s',r} p(s',r|s,a) \left[ r + \gamma \sum_{a'} \pi(s'|a') q_k(s',a') \right] \]
\[ \pi_0 \overset{\text{E}}{\longrightarrow} v_{\pi_0} \overset{\text{I}}{\longrightarrow} \pi_1 \overset{\text{E}}{\longrightarrow} v_{\pi_1}\overset{\text{I}}{\longrightarrow} \cdots \overset{\text{I}}{\longrightarrow} \pi_\ast \overset{\text{E}}{\longrightarrow} v_\ast \]
\[ v_{k+1} = \max_a \sum_{s',r} p(s',r|s,a) [r + \gamma v_k(s')] \]
\[ \begin{alignat}{2} v_{k+1} &= \sum_a \pi(a|s)& &\sum_{s',r} p(s',r|s,a) [r + \gamma v_k(s')] \\ v_{k+1} &= \max_a & &\sum_{s',r} p(s',r|s,a) [r + \gamma v_k(s')] \end{alignat} \]
Comments