Chapter 7: \(n\)-step Bootstrapping

Jake Gunther

2020/15/1

\(n\)-step Bootstrapping

Background

Unify Monte Carlo (MC) methods and the one-step temporal-difference (TD) methods
“… shift from one to the other smoothly as needed …”
“… span the spectrum with MC methods at one end and one-step TD methods at the other.”
“… free you from the tyranny of the time step.”

MC & TD Review

MC: Update \(V(S)\) based on entire sequence of rewards until end of episode (observed value of \(G\))
TD: Update \(V(S)\) using \(R + \gamma V(S')\) as proxy for return \(G\)
What kind of update lies between these extremes?
Update \(V(S)\) based on number of rewards (more than one but less than all till termination)

\(n\)-step TD Prediction

Monte Carlo Target

\[ \begin{align} G_t &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots + \gamma^{T-t-1} R_{T} \\ V(S_t) &= V(S_t) + \alpha [G_t - V(S_t)] \end{align} \]

One-Step TD target

\[ \begin{align} G_{t:t+1} &= R_{t+1} + \gamma V(S_{t+1}) \\ V(S_t) &= V(S_t) + \alpha [G_{t:t+1} - V(S_t)] \end{align} \]

\(V(S_{t+1})\) used as a surrogate for returns following \(S_{t+1}\)

Two-Step TD target

\[ \begin{align} G_{t:t+2} &= R_{t+1} + \gamma R_{t+2} + \gamma^2 V(S_{t+2}) \\ V(S_t) &= V(S_t) + \alpha [G_{t:t+2} - V(S_t)] \end{align} \]

\(V(S_{t+2})\) used as a surrogate for returns following \(S_{t+2}\)

\(n\)-Step TD target

\[ \begin{align} G_{t:t+n} &= R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^{n} V(S_{t+n}) \\ V(S_t) &= V(S_t) + \alpha [G_{t:t+n} - V(S_t)] \end{align} \]

\(V(S_{t+n})\) used as a surrogate for returns following \(S_{t+n}\)