Chapter 3: Finite Markov Decision Processes

Jake Gunther

2019/13/12

Markov Decision Processes

Overview of MDPs

Evaluative feedback
Associative aspects
Sequential decision making
Tradeoff immediate vs. delayed rewards
Bandits: $q_{*} (a)$
MDPs: $q_{*} (s, a)$ and $v_{*} (s)$

Agent-Environment Interface

Problem Representation

“Any problem of learning goal-directed behavior can be reduced to three signals passing back and forth between an agent and its environment.”

Choices (actions of agent)
Basis for choices (states of environment)
Goal definition (rewards to agent)

Agent-Environment Interface

Interaction at discrete time steps, $t = 0, 1, 2, 3, \dots$
Agent: chooses actions $A_{t} \in A (s)$
Environment: has a state $S_{t} \in S$
Environment: generates rewards $R_{t} \in R$
Trajectory: $S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{2}, S_{2}, A_{2}, R_{3}, S_{3}, \dots$

Time Steps

$t = 0, 1, 2, 3, \dots$
May refer to fixed intervals of real time
May refer to arbitrary successive stages of decision making and acting

Actions

$A_{t} \in A (s)$
Any decision we want to learn how to make
Low-level controls - voltage on a motor (continuous)
High-level decisions - have lunch (binary), go to graduate school (binary)
Can be mental or computational (think about $X$ , focus on $Y$ )

States

$S_{t} \in S$
Anything we can know that might be useful in decision making
Low-level sensations (sensor readings)
High-level and abstract (symbolic description of objects in a room)
May include memory
May be mental (not knowing: where are my keys)
May be subjective (surprise in a clearly defined sense)

Rewards

$R_{t} \in R$
Formalizes the goal of the agent
Part of the environment
Computed/derived inside natural or artificial (robotic) agents, but considered external to agent

Agent-Environment Boundary

Environment: anything that cannot be changed arbitrarily by agent
Boundary: limit of agent’s absolute control (but not of its knowledge)
Robots: motors, linkages, sensors $\in$ environment
Person/animal: muscles, skeleton, sensory organs $\in$ environment

Example: Bioreactor

Objective: Determine temperature and stirring rate for bioreactor
State: thermocouple, other sensors, ingredients, target chemicals
Actions: target temperature and stirring rate (passed to lower level controller)
Reward: rate of target chemical production
States & actions: vectors
Reward: scalar value

Example: Pick-and-Place Robot

Objective: Control motion of robotic arm in repetitive task (fast smooth motion)
State: Positions and velocities of linkages
Action: Motor voltages
Reward: +1 (for each pick-and-place success) - jerkiness of motion

Discuss these Examples

Stop smoking
Play Packman
Invest wisely
Driving to a destination (move the boundry around)

Look at Example 3.3

Action set that depends on the state
Markovity of states defined in graph
Dynamics are tabulated (finite MDP)

Mathematical Formulation

Finite MDP

$S, A, R$ are finite sets
RVs $S_{t}, A_{t}, R_{t}$ have PMFs
MDP dynamics:

$\begin{matrix} p (s^{'}, r | s, a) = Pr {\begin{matrix} S_{t} = s^{'} \\ R_{t} = r \end{matrix} | \begin{matrix} S_{t - 1} = s \\ A_{t - 1} = a \end{matrix}} \\ \sum_{s^{'} \in S} \sum_{r \in R} p (s^{'}, r | s, a) = 1, \forall s \in S, a \in A (s) \end{matrix}$

Markov Property

Not restriction on MDP
Markovity is in the state

$(S_{t}, A_{t}) \mapsto S_{t + 1}$

Goals and Rewards

Goals

Maximize total amount of reward received
Don’t focus on immediate (short run) reward
Focus on cumulative reward in the long run

Rewards

Incentivize what you want the agent to learn
If you want the agent to learn “to do something for us, then provide rewards to it in such a way that in maximizing them the agent will also achieve our goals.”
Rewards communicate what to achieve, not how to achieve it

Chess

Reward winning (what)
Don’t reward subgoals (how)
- Taking opponent’s pieces (how)
- Controlling center of board (how)
- +10 for win (reward winning the game)
- -1 for each turn (reward winning the game in few moves)

Examples

Do these reward the right thing?

Robotic walking - make reward proportional to forward motion on each time step
Escape maze - reward is -1 for every time step

Returns and Episodes

Returns

Choose $A_{t}$ to maximize return $G_{t}$
Return $G_{t}$ is cumulative reward received after time $t$
Return $G_{t} = R_{t + 1} + R_{t + 2} + R_{t + 3} + \dots + R_{T}$
Return could be some other function of reward sequence
Note: Action $A_{t}$ cannot maximize $G_{t - k}$ for $k > 0$

Episodes

Episode is a repeated interaction
- Starting state is known or drawn from distribution
- Ending in terminal state $s \in S^{+}$ at time $T < \infty$
- $T$ is an RV
Episodic tasks (games, mazes, etc.)
Nonepisodic tasks continue without end
- Process control task
- Robot with long life span
- $T = \infty$

Discounting

Choose $A_{t}$ to maximize discounted return $G_{t} = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1}$
$γ \in [0, 1]$ is the discount rate
Discounting puts diminishing weight on future rewards
$γ = 0$ (myopic agent) focuses attention on immediate reward $R_{t + 1}$
$γ \to 1$ (farsighted agent)

Recursion

$\begin{aligned} G_{t} & = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} = R_{t + 1} + γ G_{t + 1} \forall t < T \\ G_{t} & = \sum_{k = 0}^{\infty} γ^{k} = \frac{1}{1 - γ}, R_{t} = 1 \forall t \end{aligned}$

Example: Cart-Pole

Objective: Balance the pole and stay on track
Initialize: Start with pole balanced in track center
Episodic: $R_{t} = + 1$ when not fail ( $γ = 1$ )
Q: What is $G_{t}$ ?
A: The number of steps until failure
This keeps pole balanced as long as possible ( $G_{t} \to \infty$ )

Example: Cart-Pole

Objective: Balance the pole and stay on track
Initialize: Start with pole balanced in track center
Continuing: $R_{t} = - 1$ for failure and $R_{t} = 0$ otherwise (with $γ < 1$ )
Return: $G_{t} = - γ^{K}$ where $K$ is the number of steps to failure
This keeps pole balanced as long as possible

Objective: Balance the pole and stay on track
Initialize: Start with pole balanced in track center
Episodic: $R_{t} = - 1$ for failure and $R_{t} = 0$ otherwise (with $γ < 1$ )
Return: $G_{t} = - γ^{T - t - 1}$

Maze escape robot

$R_{t} = 1$ for esacpe and $R_{t} = 0$ otherwise. Why doesn’t this work?
How can we communicate through the reward what we want the robot to do?

Unified Notation

Task Types

Episodic: $S_{t, i}, A_{t, i}, R_{t, i}, π_{t, i}, T_{i}$ for the $i^{th}$ episode
Convention: Drop the $i$ most of the time
Continuing: $S_{t}, A_{t}, R_{t}, π_{t}, T$
Same notation for both task types

Return

Epsiodic: $G_{t} = R_{t + 1} + R_{t + 2} + \dots + R_{T}$
Continuing: $G_{t} = R_{t + 1} + R_{t + 2} + R_{t + 3} + \dots$
Want same notation for both task types
To unify notation, define absorbing (square) state for episodic task where rewards are zero

Unified Notation

$G_{t} = \sum_{k = 0}^{T - t - 1} γ^{k} R_{t + k + 1} = \sum_{k = t + 1}^{T} γ^{k - t - 1} R_{k}$

Continuing: $T = \infty$ and $γ \in [0, 1)$ (need convergence)
Episodic: $T < \infty$ and $γ \in [0, 1]$ (convergence guaranteed)
Episode numbers not needed

Policies

Policy

$\begin{matrix} π (a | s) = Pr {A_{t} = a | S_{t} = s} \\ a \in A (s), s \in S \end{matrix}$

Policy: How the agent chooses actions in the context of a state
RL methods: How $π$ changes based on experience

Value Functions

State-Value Function for Policy $π$

$v_{π} (s) = E_{π} [G_{t} | S_{t} = s] = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} | S_{t} = s]$

Expected return (value) when starting in state $s$ and following policy $π$ thereafter

Action-Value for Policy $π$

$\begin{aligned} q_{π} (s, a) & = E_{π} [G_{t} | S_{t} = s, A_{t} = a] \\ = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} | S_{t} = s, A_{t} = a] \end{aligned}$

Expected return (value) of taking action $a$ in state $s$ and following policy $π$ thereafter

Look Ahead (Monte Carlo Methods)

Expectations in $v_{π}$ and $q_{π}$ can be estimated from sample averages over many episodes
How? For each state maintain average return that followed that state under policy $π$ . This converges to $v_{π}$ .
How? For each state-action pair maintain average return that followed under policy $π$ . This converges to $q_{π}$ .

Look Ahead (Approximation Methods)

If too many states, then keeping averages in each state is not practical.
Approximate $v_{π}, q_{π}$ as parameterized functions (e.g. DNN) and adjust parameters to match observed returns.

Value Recursions

Exercises

Give an equation for $v_{π}$ in terms of $q_{π}$ and $π$ (solution)
Given an equation for $q_{π}$ in terms of $v_{π}$ and four-argument $p$ (solution)

Answers

$\begin{aligned} v_{π} (s) & = \sum_{a} q_{π} (s, a) π (s | a) \\ q_{π} (s, a) & = \sum_{s^{'}} \sum_{r} [r + γ v_{π} (s^{'})] p (s^{'}, r | s, a) \end{aligned}$

Backup Diagram for $v_{π}$

$v_{π} (s) = \sum_{a} q_{π} (s, a) π (s | a)$

Backup Diagram for $q_{π}$

$q_{π} (s, a) = \sum_{s^{'}} \sum_{r} [r + γ v_{π} (s^{'})] p (s^{'}, r | s, a)$

Value Recursion

Substitute $v_{π}$ - $q_{π}$ relations into one another to obtain recursions.

$\begin{aligned} v_{π} (s) & = \sum_{a} π (a | s) \sum_{s^{'}, r} [r + γ v_{π} (s^{'})] p (s^{'}, r | s, a) \\ q_{π} (s, a) & = \sum_{s^{'}, r} [r + γ \sum_{a^{'}} q_{π} (s^{'}, a^{'}) π (s^{'} | a^{'})] p (s^{'}, r | s, a) \end{aligned}$

State-Value Recursion

$v_{π} (s) = \sum_{a} π (a | s) \sum_{s^{'}, r} [r + γ v_{π} (s^{'})] p (s^{'}, r | s, a)$

Consistency condition between $v_{π} (s)$ and $v_{π} (s^{'})$
Expected value of $r + γ v_{π} (s^{'})$ over $p (s^{'}, r, a | s) = π (a | s) p (s^{'}, r | s, a)$
Bellman equation for $v_{π}$
$v_{π}$ is unique solution to Bellman equation
Bellman equation: Compute, approximate, learn $v_{π}$

Backup Diagram for $v_{π}$

Graphical representation of $v_{π}$ recursion from $s^{'}$ to $s$

Start state $s$ at top (open circle)
Policy $π$ gives action $a$ (solid circle)
Environment responds with reward $r$ and next state $s^{'}$ according to probability $p$

Examples

Gridworld

See Example 3.5 in text (page 60).
Here are my calculations.

Optimality

Optimal Policies

Value function defines a partial order for policies

$π \geq π^{'} \Leftrightarrow v_{π} (s) \geq v_{π^{'}} (s) \forall s \in S$

An optimal policy exists but may not be unique
Denote optimal policies by $π_{*}$
Denote optimal state-value function $v_{*} = v_{π_{*}}$

Optimal Policies

$\begin{aligned} v_{*} (s) & = max_{π} v_{π} (s) \forall s \in S \\ q_{*} (s, a) & = max_{π} q_{π} (s, a) \forall s \in S and a \in A (s) \end{aligned}$

The same policy optimizes both value functions

Relations

Q: What does the optimal policy $π_{*}$ look like?
A: It puts all its weight on the best action (greed)

$\begin{matrix} π_{*} (a | s) = 1_{a = a_{*}} = {\begin{cases} 1, & a = a_{*} \\ 0, & otherwise \end{cases} \\ v_{*} (s) = \sum_{a \in A (s)} π_{*} (a | s) q_{*} (s, a) = max_{a \in A (s)} q_{*} (s, a) \end{matrix}$

Recursion for Optimal Value Function

Leveraging this derivation and $π = π_{*}$ , we have

$\begin{aligned} v_{*} (s) & = max_{a} q_{*} (s, a) \\ = max_{a} E [R_{t + 1} + γ G_{t + 1} | S_{t} = s, A_{t} = a] \\ = max_{a} E [R_{t + 1} + γ v_{*} (S_{t + 1}) | S_{t} = s, A_{t} = a] \\ = max_{a} \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ v_{*} (s^{'})] \end{aligned}$

Apply Greed

$\begin{aligned} v_{π} (s) & = \sum_{a} π (a | s) \sum_{s^{'}, r} [r + γ v_{π} (s^{'})] p (s^{'}, r | s, a) \\ v_{*} (s) & = max_{a} \sum_{s^{'}, r} [r + γ v_{*} (s^{'})] p (s^{'}, r | s, a) \end{aligned}$

$\sum_{a} π_{*} (a | s) \overset{greed}{⟶} max_{a}$

Recursion for Optimal Action-Value Function

Applying greed, we have

$\begin{aligned} q_{π} (s, a) & = \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ \sum_{a^{'}} π (s^{'} | a^{'}) q_{π} (s^{'}, a^{'})] \\ q_{*} (s, a) & = \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ max_{a^{'}} q_{*} (s^{'}, a^{'})] \end{aligned}$

Backup Diagrams for Optimal Value Functions

Optimal Policy for Gridworld

(We don’t yet know how to find $v_{*}$ . Take this as given for now.)

Optimal Policy $\leftarrow$ Optimal Value

Given $v_{*} (s)$ , $π_{*} (a | s)$ assigns non-zero probability to best action(s) and zero probability to all other actions
Optimal policy $π_{*}$ is greedy wrt optimal value $v_{*}$
A one-step (short-run) search is long-run optimal
$v_{*}$ “takes into account the reward consequences of all possible future behavior”
Expected long-term return is encoded into $v_{*}$

Optimal Policy $\leftarrow$ Optimal Value

Same for $q_{*} (s, a)$ … in any state $s$ , choose the best action $a$
Costs more to store $q_{*} (s, a)$ than $v_{*} (s)$
$q_{*}$ encodes best action(s) without needing to know values of next states or environment dynamics $p (s^{'}, r | s, a)$

Bellman’s Eqations in Practice

$\begin{aligned} v_{*} (s) & = max_{a} \sum_{s^{'}, r} [r + γ v_{*} (s^{'})] p (s^{'}, r | s, a) \\ q_{*} (s, a) & = \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ max_{a^{'}} q_{*} (s^{'}, a^{'})] \end{aligned}$

Usually don’t know $p (s^{'}, r | s, a)$
Usually can’t compute (sum and max too big)
Markov property (assume this)

Chapter 3: Finite Markov Decision Processes Jake Gunther 2019/13/12