Jake Gunther
2019/13/12
Tells how good an action was
Depends entirely on the action taken
Tells what action to take
Independent of action taken
This is supervised learning
Choosing different actions in different settings
Learning to act in only one situation
We are going to study evaluative RL in nonassociative setting.
for i=1, 2, ..., N
(1) choose action in {A_1, A_2, ..., A_k}
(2) draw reward from PDF(action)
Pull the best arm (action).
The reward is random (governed by a PDF).
Each arm (action) has different reward PDF.
\[ 1_\text{predicate} = \begin{cases} 1, & \text{predicate is true} \\ 0, & \text{predicate is false} \end{cases} \]
\[ 1_{A_i = a} = \begin{cases} 1, & A_i = a \\ 0, & A_i \neq a \end{cases} \]
\[ \begin{align} q_\ast(a) &= E[R_t | A_t = a] \\[1em] Q_t(a) &= \frac{\displaystyle \sum_{i=1}^{t-1} R_i 1_{A_i=a}} {\displaystyle \sum_{i=1}^{t-1} 1_{A_i=a}} \end{align} \]
Convergence:
\(Q_t(a) \rightarrow q_\ast(a)\)
as \(\displaystyle \sum_{i=1}^{t-1} 1_{A_i=a} \rightarrow \infty\)
Convergence speed?
How would you exploit knowlege of the value function \(q_\ast(a) = E[R_t |A_t=a]\) to maximize total expected rewards in \(N\) trials?
Wouldn’t you always choose the maximum? \[ A_t = \arg \underset{a}{\max} Q_t(a) \]
This is called a greedy policy
\[ A_t = \arg \underset{a}{\max} Q_t(a) \]
Short run (few time steps available)
Long run (many time steps available)
\[ A_t = \begin{cases} \arg \underset{a}{\max} Q_t(a), & \text{with prob} \;1-\varepsilon \\ a_i, \;i \sim U[1, 2, \cdots, k], & \text{with prob} \;\varepsilon \end{cases} \]
\[ \hat{x} = \arg \max_x f(x) \]
\[ \underset{\begin{array}{c}\text{old} \\ \text{knowledge}\end{array}}{Q_t}\overset{\begin{array}{c}\text{policy:} \\ \text{exploit} \\ \text{explore} \end{array}}{\longrightarrow} \underset{\text{action}}{A_t} \overset{\begin{array}{c} \text{bandit or} \\ \text{environment}\end{array}}{\longrightarrow} \underset{\text{action}}{R_t} \overset{\begin{array}{c} \text{learning} \\ \text{update} \end{array}}{\longrightarrow} \underset{\begin{array}{c}\text{new} \\ \text{knowledge}\end{array}}{Q_{t+1}} \]
Put these together and you have an algorithm for RL
\[ Q_n = \frac{R_1 + R_2 + \cdots + R_{n-1}}{n-1} \]
\[ \begin{align} Q_{n+1} &= \frac{1}{n} \sum_{i=1}^n R_i = \frac{1}{n} \left( R_n + \sum_{i=1}^{n-1} R_i \right) \\ &= \frac{1}{n} \left( R_n + \frac{n-1}{n-1}\sum_{i=1}^{n-1} R_i \right) \\ &= \frac{1}{n} \left( R_n + (n-1) Q_n \right) \\ &= \frac{1}{n} \left( R_n + n Q_n - Q_n\right) \\ &= Q_n + \frac{1}{n} \left[ R_n - Q_n \right] \end{align} \]
\[ \text{NewEst} \leftarrow \text{OldEst} + \text{StepSize}\left[\text{Target} - \text{OldEst}\right] \]
\[ Q_{n+1} = Q_n + \frac{1}{n} \left[ R_n - Q_n \right] \]
Stationary
Nonstationary
\[ \begin{align} Q_{n+1} &= Q_n +\alpha\left[ R_n - Q_n\right] = \cdots \\ &=(1-\alpha)^n Q_1 + \sum_{i=1}^n \alpha(1-\alpha)^{n-i} R_i \end{align} \]
{ width=50% }
{ width=50% }
\[ H_{t+1}(a) = H_t(a) + \alpha \frac{\partial E[R_t]}{\partial H_t(a)} \quad \text{(gradient ascent)} \]
\[ \begin{align} H_{t+1}(A_t) &= H_t(A_t) + \alpha (R_t - \bar{R}_t)(1-\pi_t(A_t)) \\ H_{t+1}(a) &= H_t(a) - \alpha (R_t - \bar{R}_t)\pi_t(A_t), \quad a \neq A_t \end{align} \]