Objective: Determine temperature and stirring rate for bioreactor
State: thermocouple, other sensors, ingredients, target chemicals
Actions: target temperature and stirring rate (passed to lower level controller)
Reward: rate of target chemical production
States & actions: vectors
Reward: scalar value
Example: Pick-and-Place Robot
Objective: Control motion of robotic arm in repetitive task (fast smooth motion)
State: Positions and velocities of linkages
Action: Motor voltages
Reward: +1 (for each pick-and-place success) - jerkiness of motion
Discuss these Examples
Stop smoking
Play Packman
Invest wisely
Driving to a destination (move the boundry around)
Look at Example 3.3
Action set that depends on the state
Markovity of states defined in graph
Dynamics are tabulated (finite MDP)
Mathematical Formulation
Finite MDP
are finite sets
RVs have PMFs
MDP dynamics:
Markov Property
Not restriction on MDP
Markovity is in the state
Goals and Rewards
Goals
Maximize total amount of reward received
Don’t focus on immediate (short run) reward
Focus on cumulative reward in the long run
Rewards
Incentivize what you want the agent to learn
If you want the agent to learn “to do something for us, then provide rewards to it in such a way that in maximizing them the agent will also achieve our goals.”
Rewards communicate what to achieve, not how to achieve it
Chess
Reward winning (what)
Don’t reward subgoals (how)
Taking opponent’s pieces (how)
Controlling center of board (how)
+10 for win (reward winning the game)
-1 for each turn (reward winning the game in few moves)
Examples
Do these reward the right thing?
Robotic walking - make reward proportional to forward motion on each time step
Escape maze - reward is -1 for every time step
Returns and Episodes
Returns
Choose to maximize return
Return is cumulative reward received after time
Return
Return could be some other function of reward sequence
Note: Action cannot maximize for
Episodes
Episode is a repeated interaction
Starting state is known or drawn from distribution
Ending in terminal state at time
is an RV
Episodic tasks (games, mazes, etc.)
Nonepisodic tasks continue without end
Process control task
Robot with long life span
Discounting
Choose to maximize discounted return
is the discount rate
Discounting puts diminishing weight on future rewards
(myopic agent) focuses attention on immediate reward
(farsighted agent)
Recursion
Example: Cart-Pole
Objective: Balance the pole and stay on track
Initialize: Start with pole balanced in track center
Episodic: when not fail ()
Q: What is ?
A: The number of steps until failure
This keeps pole balanced as long as possible ()
Example: Cart-Pole
Objective: Balance the pole and stay on track
Initialize: Start with pole balanced in track center
Continuing: for failure and otherwise (with )
Return: where is the number of steps to failure
This keeps pole balanced as long as possible
Objective: Balance the pole and stay on track
Initialize: Start with pole balanced in track center
Episodic: for failure and otherwise (with )
Return:
Maze escape robot
for esacpe and otherwise. Why doesn’t this work?
How can we communicate through the reward what we want the robot to do?
Unified Notation
Task Types
Episodic: for the episode
Convention: Drop the most of the time
Continuing:
Same notation for both task types
Return
Epsiodic:
Continuing:
Want same notation for both task types
To unify notation, define absorbing (square) state for episodic task where rewards are zero
Unified Notation
Continuing: and (need convergence)
Episodic: and (convergence guaranteed)
Episode numbers not needed
Policies
Policy
Policy: How the agent chooses actions in the context of a state
RL methods: How changes based on experience
Value Functions
State-Value Function for Policy
Expected return (value) when starting in state and following policy thereafter
Action-Value for Policy
Expected return (value) of taking action in state and following policy thereafter
Look Ahead (Monte Carlo Methods)
Expectations in and can be estimated from sample averages over many episodes
How? For each state maintain average return that followed that state under policy . This converges to .
How? For each state-action pair maintain average return that followed under policy . This converges to .
Look Ahead (Approximation Methods)
If too many states, then keeping averages in each state is not practical.
Approximate as parameterized functions (e.g. DNN) and adjust parameters to match observed returns.