Chapter 2: Multi-Armed Bandits

Jake Gunther

2019/13/12

Evaluative vs. Instructive Learning

Pure Evaluation

Tells how good an action was

Depends entirely on the action taken

Pure Instruction

Tells what action to take

Independent of action taken

This is supervised learning

Associative vs. Nonassociative Settings

Associative Setting

Choosing different actions in different settings

Nonassociative Setting

Learning to act in only one situation

Context for Chapter 2

We are going to study evaluative RL in nonassociative setting.

\(K\)-armed bandits

What is a \(K\)-Armed Bandit?

for i=1, 2, ..., N
   (1) choose action in {A_1, A_2, ..., A_k}
   (2) draw reward from PDF(action)

Repeatedly choose from \(k\) actions according to some policy
Receive reward drawn from pdf depending on action
Objective: maximize the total expected reward over \(N\) actions
What policy does this?
What are good/bad policies?

Example: Slot Machine

Multi-armed bandit. Pull one lever at a time.

Pull the best arm (action).

The reward is random (governed by a PDF).

Each arm (action) has different reward PDF.

Some Math

Definitions

\(A_t =\) action selected at time \(t\)
\(R_t =\) reward for action \(A_t\) at time \(t\)
\(q_\ast(a) = E[R_t | A_t = a] =\) value of action \(a\)
\(Q_t(a)\) is an estimate of \(q_\ast(a)\)
\(\text{Policy}: Q_t \mapsto A_t\) (maps value to action)

Indicator Function

\[ 1_\text{predicate} = \begin{cases} 1, & \text{predicate is true} \\ 0, & \text{predicate is false} \end{cases} \]

\[ 1_{A_i = a} = \begin{cases} 1, & A_i = a \\ 0, & A_i \neq a \end{cases} \]

If \(\text{predicate}\) involves an RV, then \(1_\text{predicate}\) is an RV, i.e. function of an RV.

Value Estimates

Sample Average Estimate

\[ \begin{align} q_\ast(a) &= E[R_t | A_t = a] \\[1em] Q_t(a) &= \frac{\displaystyle \sum_{i=1}^{t-1} R_i 1_{A_i=a}} {\displaystyle \sum_{i=1}^{t-1} 1_{A_i=a}} \end{align} \]

Convergence:

\(Q_t(a) \rightarrow q_\ast(a)\)

as \(\displaystyle \sum_{i=1}^{t-1} 1_{A_i=a} \rightarrow \infty\)

by Law of Large Numbers.

Convergence speed?

Policies

Policy if \(q_\ast(a)\) is known?

How would you exploit knowlege of the value function \(q_\ast(a) = E[R_t |A_t=a]\) to maximize total expected rewards in \(N\) trials?
Wouldn’t you always choose the maximum? \[ A_t = \arg \underset{a}{\max} Q_t(a) \]
This is called a greedy policy

Policy if \(q_\ast(a)\) is unknown?

If \(q_\ast(a)\) is unknown, then use the estiamte \(Q_t(a)\).

Greedy Policy

\[ A_t = \arg \underset{a}{\max} Q_t(a) \]

Choose action with best estimated value
Greed always chooses the best action
If \(Q_t(a) = q_\ast(a)\), then greed is good (optimal)
If \(Q_t(a) \neq q_\ast(a)\), then greed may be bad (sub-optimal)

Notation

\(q_\ast(a)\) not a function of \(t\). Assumed to be constant.
\(Q_t(a)\) is a function of \(t\). This function changes with time. This is learning.
The greedy action \(A_t = \arg \underset{a}{\max} Q_t(a)\) may change with time too. This is learning.

Exploitation vs. Exploration

Greedy action exploits knowledge of \(Q_t(a)\)
Non-greedy actions explore other actions
If \(Q_t(a) \neq q_\ast(a)\), then exploration can find super-greedy actions that can be used to improve the estimate \(Q_t(a)\).

Short Run vs. Long Run

Short run (few time steps available)
- Exploitation is probably best in the short run
- Greedy actions maximize immediate rewards
Long run (many time steps available)
- Exploration can find better/best actions which can be exploited many times in the long run

\(\varepsilon\)-Greedy Policy

\[ A_t = \begin{cases} \arg \underset{a}{\max} Q_t(a), & \text{with prob} \;1-\varepsilon \\ a_i, \;i \sim U[1, 2, \cdots, k], & \text{with prob} \;\varepsilon \end{cases} \]

Greedy most of the time (exploitation)
Randomly explore non-greedy actions

Learning Algorithm

Learning

Learning means updating the value estimate \(Q_t(a)\)
Selected actions evolve as \(Q_t(a)\) changes
Learning algorithms explore and exploit (need both, how much of each?)
- Greedy policy is non-learning (it does adapt/evolve, however)
- \(\varepsilon\)-Greedy policy learns

Learning vs. Optimization

Parameter Optimization

\[ \hat{x} = \arg \max_x f(x) \]

Choose parameter \(x\) to maximize an objective function \(f\)

Learning

Objective: maximize total expected reward
Policy: \(Q_t \mapsto A_t\) (take action)
Bandit: \(A_t \mapsto R_t\) (receive reward)
Update: \(Q_t \mapsto Q_{t+1}\) (update estimate)
Note: \(Q_t \underset{t\rightarrow \infty}{\longrightarrow} q_\ast\) is sufficient (what algorithms?)
Learning is different from parameter optimization

Learning

Move from what we know now (at time \(t\)) through interaction with the environment to new knowledge (at time \(t+1\))

\[ \underset{\begin{array}{c}\text{old} \\ \text{knowledge}\end{array}}{Q_t}\overset{\begin{array}{c}\text{policy:} \\ \text{exploit} \\ \text{explore} \end{array}}{\longrightarrow} \underset{\text{action}}{A_t} \overset{\begin{array}{c} \text{bandit or} \\ \text{environment}\end{array}}{\longrightarrow} \underset{\text{action}}{R_t} \overset{\begin{array}{c} \text{learning} \\ \text{update} \end{array}}{\longrightarrow} \underset{\begin{array}{c}\text{new} \\ \text{knowledge}\end{array}}{Q_{t+1}} \]

Reinforcement Learning

Putting it all Together

Action space defined
Reward function
- pdf known
- simulation
- real world
Value Estimator
- initialization
- update
Policy (maps value estimator to actions)

Put these together and you have an algorithm for RL

Testbed

Performance

Numerics from sample runs || Code

Incremental Implementation

Calculation Trick

\[ Q_n = \frac{R_1 + R_2 + \cdots + R_{n-1}}{n-1} \]

\(R_n=\) reward after a particular action
\(Q_n=\) value estimate for particular action

\[ \begin{align} Q_{n+1} &= \frac{1}{n} \sum_{i=1}^n R_i = \frac{1}{n} \left( R_n + \sum_{i=1}^{n-1} R_i \right) \\ &= \frac{1}{n} \left( R_n + \frac{n-1}{n-1}\sum_{i=1}^{n-1} R_i \right) \\ &= \frac{1}{n} \left( R_n + (n-1) Q_n \right) \\ &= \frac{1}{n} \left( R_n + n Q_n - Q_n\right) \\ &= Q_n + \frac{1}{n} \left[ R_n - Q_n \right] \end{align} \]

Update Formula

\[ \text{NewEst} \leftarrow \text{OldEst} + \text{StepSize}\left[\text{Target} - \text{OldEst}\right] \]

\[ Q_{n+1} = Q_n + \frac{1}{n} \left[ R_n - Q_n \right] \]

\(\left[\text{Target} - \text{OldEst}\right]\) is the error in the estimate
Step is in a direction that reduces the error on average
The target is an RV
StepSize is time varying, denoted by \(\alpha_n(a)\)

Simple Bandit Algorithm

Nonstationary Problems

Stationary

Reward PDF invariant
\(\alpha_n(a) = \frac{1}{n}\) allows convergence to a stationary reward PDF
Puts less and less weight on recent errors
Fails in nonstationary case

Nonstationary

Reward PDF changes
Real problems nonstationary
Want more weight on recent errors
\(\alpha_n(a) = \alpha\) continuous adaptation
Uncertainty \(\Rightarrow\) always explore

Constant Step Size

\[ \begin{align} Q_{n+1} &= Q_n +\alpha\left[ R_n - Q_n\right] = \cdots \\ &=(1-\alpha)^n Q_1 + \sum_{i=1}^n \alpha(1-\alpha)^{n-i} R_i \end{align} \]

Exponential recency-weighted average
Eventually the initial condition is forgotten
More weight given to recently observed rewards

Initial Values

\(Q_n(a)\) is biased by initial values \(Q_1(a)\)
Bias fades for \(\alpha_n(a)=\alpha\)
Bias expires for \(\alpha_n(a)=\frac{1}{n}\) once \(a\) is chosen once
Can set \(Q_1(a) = 0\) for all \(a\)
\(Q_1(a)\) are user selectable parameters

Optimistic Initial Values

Setting \(Q_1(a)\) unreasonably high forces an effect like exploration, even for pure greedy policies
Example: \(\alpha=0.1\) in recursive update for \(Q_n(a)\)

{ width=50% }

Observations

Optimistic initial values is a trick to force early exploration
It works well in stationary environments, but not nonstationary
Sample average methods also treat beginning of time as special case

Preferences and Policy PDF

Definitions

\(H_t(a) =\) relative preference for action \(a\)
\(Pr\{A_t=a\} = \pi_t(a) = \frac{e^{H_t(a)}}{\sum_{b=1}^{k} e^{H_t(b)}} =\) policy PDF
Actions selected at random from \(\pi_t(a)\)
\(E[R_t] = E[ E[R_t |A_t] ] = E[q_\ast(A_t)]=\displaystyle \sum_a \pi_t(a) q_\ast(a)\)

Policy Gradient

Homework: Show that gradient ascent leads to the updates below (see pg. 38+)

\[ H_{t+1}(a) = H_t(a) + \alpha \frac{\partial E[R_t]}{\partial H_t(a)} \quad \text{(gradient ascent)} \]

\[ \begin{align} H_{t+1}(A_t) &= H_t(A_t) + \alpha (R_t - \bar{R}_t)(1-\pi_t(A_t)) \\ H_{t+1}(a) &= H_t(a) - \alpha (R_t - \bar{R}_t)\pi_t(A_t), \quad a \neq A_t \end{align} \]