Chapter 4: Dynamic Programming

Jake Gunther

2019/13/12

Background & Review

Comments

  • Must know \(p(s',r|s,a)\) to apply DP
  • DP only applies to finite (and small) MDPs
  • Have to approximate otherwise
  • DP basis for understanding approximation methods
  • DP learns \(v_\ast\) and \(q_\ast\), then distill \(\pi_\ast\)

Optimal Value Function Recursions

\[ \begin{align} v_\ast(s) &= \max_a E[R_{t+1} + \gamma v_\ast(S_{t+1}) | S_t=s,A_t=a] \\ &= \max_a \sum_{s',r} p(s',r|s,a) [r+\gamma v_\ast(s')] \\ q_\ast(s,a) &= E[R_{t+1}+\gamma \max_{a'} q_\ast(S_{t+1},a') | S_t=s,A_t=a] \\ &= \sum_{s',r} p(s',r|s,a) [r + \gamma \max_{a'}q_\ast(s',a')] \end{align} \]

Value Functions Recursions

\[ \begin{align} v_\pi(s) &= \sum_a \pi(a|s) \sum_{s',r} \left[ r + \gamma v_\pi(s') \right]p(s',r|s,a) \\ q_\pi(s,a) &= \sum_{s',r} p(s',r|s,a) \left[ r + \gamma \sum_{a'} \pi(s'|a') q_\pi(s',a') \right] \end{align} \]

Policy Evaluation (Prediction)

Policy Evaluation (Prediction)

Policy evaluation (prediction) is computing \(v_\pi\) for policy \(\pi\)

Computing \(v_\pi\) for Policy \(\pi\)

Bellman’s equations (could solve as linear system):

\[ v_\pi(s) = \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a) [r+\gamma v_\pi(s')] \]

Iterative policy evaluation (expected update):

\[ v_{k+1}(s) = \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a) [r+\gamma v_k(s')] \]

\(v_k(s) \rightarrow v_\pi\)

In-place Computations

  • Need two arrays to update all states in parallel, or
  • Do in-place computations and update sequentially

Algorithm

Examples

Exercise

  • What does iterative policy evaluation (prediction) look like for \(q_\pi\)?
  • Discuss storage requirements compared with \(v_\pi\).

Exercise Answer

\[ q_{k+1}(s,a) = \sum_{s',r} p(s',r|s,a) \left[ r + \gamma \sum_{a'} \pi(s'|a') q_k(s',a') \right] \]

  • Storing \(v_\pi(s)\) requires \(|\mathcal{S}|\) locations
  • Storing \(q_\pi(s,a)\) requires \(|\mathcal{S}| \times |\mathcal{A}|\) locations

Back to Gridworld

  • Let’s go back to grid world and solve for \(q_\pi\). Click here

Policy Improvement

Policy Improvement

Go to notes.

Policy Iteration

Policy Iteration

\[ \pi_0 \overset{\text{E}}{\longrightarrow} v_{\pi_0} \overset{\text{I}}{\longrightarrow} \pi_1 \overset{\text{E}}{\longrightarrow} v_{\pi_1}\overset{\text{I}}{\longrightarrow} \cdots \overset{\text{I}}{\longrightarrow} \pi_\ast \overset{\text{E}}{\longrightarrow} v_\ast \]

Policy Iteration Algorithm

Value Iteration

Policy Iteration Summary

  • Policy iteration step 1: Policy evaluation (prediction) - iteratively sweep over states
  • Policy iteration step 2: Policy improvement

Value Iteration Summary

  • Value iteration step 1: Policy evaluation (prediction) - do just one sweep
  • Value iteration step 2: Policy improvement

\[ v_{k+1} = \max_a \sum_{s',r} p(s',r|s,a) [r + \gamma v_k(s')] \]

  • Combines improvement and evaluation in one eq.
  • Iterated version of Bellman’s eq. for optimal value \(v_\ast\)

Value Iteration Algorithm

Family of Value Iteration Algs.

\[ \begin{alignat}{2} v_{k+1} &= \sum_a \pi(a|s)& &\sum_{s',r} p(s',r|s,a) [r + \gamma v_k(s')] \\ v_{k+1} &= \max_a & &\sum_{s',r} p(s',r|s,a) [r + \gamma v_k(s')] \end{alignat} \]

  • Mix policy evaluation sweeps (1st eq.) with value iteration sweeps (2nd eq.)
  • Expectation (1st eq.) and maximization (2nd eq.)

Homework

  • Work the Gambler’s Problem in example 4.2 on page 84. Use value iteration.
  • Do exercise 4.9
  • Do exercise 4.10

Asynchronous Dynamic Programming

Asynchronous DP

  • In-place iterative DP algorithms not organized by linear sweeps over state space
  • Update values in any order
  • Update values in some states more often than in others (can’t ignore any state)
  • Update in stochastic manner
  • For convergence guarantee each state must be visited an infinite number of times
  • Update based on what agent actually experiences

Generalized Policy Iteration

Various Approaches

  • Policy Iteration
    • Make \(v_{k+1}\) consistent with \(\pi_k\) (policy evaluation)
    • Make \(\pi_{k+1}\) greedy wrt \(v_{k+1}\) (policy improvement)
  • Value Iteration
    • Truncate policy evaluation (to one sweep)
    • Mixed numbers of evaluation and improvement steps (full sweeps)
  • Asynchronous DP
    • Partial sweeps
    • Random sweeps

Generalized Policy Iteration

  • Let policy evaluation and policy improvement processes interact
  • “All” RL methods are GPI: have values and policies
    • Policies being improved wrt values
    • Values driving toward agreement with policy
  • Evaluation and improvement processes compete

Generalized Policy Iteration

Solution Methods

Solution Methods

  • Exhaustive search is expensive or intractable
  • Linear programming can efficiently solve MDPs but become impractical for smaller numbers of states
  • DP is actually quite efficient (up to millions of states)
  • DP convergence typically better than theoretical worst case
  • Asynch. DP preferred because only a few states along optimal solution trajectories