Chapter 6: Temporal-Difference Learning
Jake Gunther
2019/13/12
Temporal-Difference Learning
Background
- Combines MC and DP ideas
- Learns from experience like MC
- Bootstraps like DP
MC Prediction
- Constant suitable for nonstationary env.
- MC updates at end of episode when known
TD Prediction
- Uses idea that and
- Updates immediately on transition to when is received
- Does not wait till end of episode
- Called one-step TD or
TD Prediction Algorithm

TD(0) Bootstraps
- TD estimate depends on previously estimated value
- Also note relation between MC and TD targets
- TD combines sampling of MC and bootstrap of DP
TD(0) Backup Diagram

- TD and MC use sample updates
- DP uses expected updates
TD Advantages
- TD bootstraps (like DP, MC doesn’t)
- TD does not requires (DP requires it)
- TD is online and incremental (MC updates after episodes)
- TD applies to continuing tasks (MC oriented to episodic tasks)
- TD learns from every transition (MC ignores exploratory episodes)
TD(0) Convergence
For stationary MDP and fixed policy , TD prediction (learning )
- Convergence in mean to with =constant
- Convergence with prob. 1 with decreasing
In practice, TD methods converge faster than constant- MC on stochastic tasks
Example 6.2 Random Walk
![]()
Homework: Do exercises 6.3 and 6.4 and 6.6.
Batch Update Algorithm
- Finite amount of experience, say, episodes
- TD increments computed for every time step
- Value function estimate changed only once by sum of increments
- Repeat, process all experience again, etc.
- With small , TD(0) and MC convergence determinsitically … to different answers!
Example 6.3 Random Walk
![]()
- MC averages actual returns and is optimal (minimizes MSE)
- How does TD perform better?
- TD is optimal in a way that is more relevant to predicting returns
Example 6.4
See example on page 127-128.
What answers do TD and MC give?
Optimality
- Batch MC minimizes MSE on training set
- Batch TD(0) performs ML estimation for Markov model
Certainty-Equivalence Estimate
Given observed episodes:
- =fraction of observed transitions
- =average of reward on those transitions
Batch TD(0) converges to the certainty-equivalent estimate (CEE)
Clues to Convergence
- Batch TD(0) faster than MC becuase of CEE
- Non-batch methods do not achieve CEE or MSE, they move in those directions, respectively
- Nonbatch TD(0) faster than constant- MC because it is moving toward a better estimate
Exercise & Homework
Start exercise 6.7 (page 128-129) in class and finish as homework.
SARSA: On-Policy TD Control
Background
- Control = Learning optimal policy
- GPI = Prediction () + Improvement ()
- Use TD methods for prediction
- Trade exploration and exploitation
- On-policy and off-policy
- State value and action value functions
Episode
![]()
- Previously considered transitions and learned value of states
- Now consider transitions and learn value of state-action pairs
SARSA Update
- Update performed immediately after every transition from nonterminal state
- if is terminal
- Uses quintuples
SARSA Backup Diagram
![]()
?
On-Policy SARSA Control
- Estimate (prediction)
- Use as behavior policy
- Change toward greediness wrt (control)
Convergence of SARSA Control
SARSA converges with prob. 1 to provided:
- Policy converges to greedy
- Use -greedy or -soft policies with
- Assures all state-action pairs have infinite visits
Exercise 6.8
Do exercise 6.8 (page 129).
Example 6.5 Windy Gridworld
![]()
- Discuss example and performance curve
- How can MC fail for this problem?
- See code here
Homework
Do Exercises 6.9 and 6.10 (page 130-131).
Q-Learning
Q-learning = Off-policy TD control
- Learned approximates independent of behavior policy
- Behavior policy is -greedy
Convergence
Guaranteed with prob. 1 provide all state-action pairs are visited
Q-Learning Algorithm
![]()
Why is this considered off-policy?
Backup Diagrams
![]()
Comparison
Compare the performance of SARSA and Q-learning in examples 6.6 (page 132).
Why does SARSA have better rewards?
Why does SARSA choose the safer path?
Expected SARSA
- E-SARSA moves deterministically in the same direction that SARSA moves in expectation
- E-SARSA has lower variance
- Given same amount of experience, E-SARSA should perform better
Comparison
![]()
Discussion
- Can do off-policy E-SARSA
- E-SARSA = Q-learning when target policy is greedy
- E-SARSA generalizes Q-learning
Q-Learning is Max. Biased
![]()
Why does Q-learning choose left (wrong choice)?
Maximization Bias
- Max of estimated value used as estimate of max of true value
- Max of estiamted value max of true value (bias)
- Problem: same data used to determine max action and estimated value
- Solution: Split the data and have two value functions
Double Learning
Getting the value from of the greedy decision from to update eliminates maximization bias.
Double Learning Algorithm
![]()
Homework
- Add a learning curve for E-SARSA to Fig. 6.5 in example 6.7 (page 134-135)
- Do exercise 6.13 on page 136
Tic-Tac-Toe
![]()
- Discuss tic-tac-toe in the agent-environment setting
- Discuss state vs. “afterstate”
- Discuss state-action vs. position-move
- Why is afterstate-value more efficient than action-value?
Chapter 6: Temporal-Difference Learning
Jake Gunther
2019/13/12