Paper Keypoints

As I read papers on reinforcement learning, there are key points that I like to look for and I've listed some of these below along with questions to prompt thinking.

Technical Contributions and Innovations

What are the key innovations and/or technical contributions of the paper?
What problems with prior methods are overcome/solved by the methods introduced in the paper?

Problem or Application Area

Is the paper dedicated to learning for a particular task?
Describe the nature of the problem or application being addressed in the paper? (Note: Theoretical papers may have no specific task in mind.)

State and Action Spaces

Describe the state and action spaces. Are they discrete and finite, discrete and high dimensional, or continuous.
Does the problem addressed in the paper fit the Markov decision process (MDP) formalism?

Reward Function

Describe the reward function.
What considerations went into the choice of reward function?
How does the reward function communicate to the agent what is to be learned?

Features and Function Approximation

$\hat{v}(s,\mathbf{w}) = \mathbf{w}^T \mathbf{x}(s)$ $\hat{q}(s,a,\mathbf{w}) = \mathbf{w}^T \mathbf{x}(s,a)$ .
Describe the features used in the paper.
Were they hand-crafted by human experts?
Did they encode knowledge of the problem domain?
Were they learned using neural networks?
If neural networks are used:
- What are the inputs and outputs?
- What is the architecture of the network?

Learning Algorithm

Describe the learning algorithm used in the paper.
Describe the roots of the learning algorithm.
- Where did it come from?
- What is the objective function/performance measure/loss function?
- What are the main steps in the derivation of the update ruules?
- What assumptions or approximations were used in the derivation of the update rule?
- How is the algorithm related to other algorithms we learned in Sutton and Barto's book?
- $n$ -step?
Is the algorithm on or off policy?
How does the algorithm balance exploration and exploitation?
Describe the algorithm relative to the deadly triad (off-policy, bootstrapping, function approximation).
Does the algorithm learn a value function or a policy or both?
Where does the algorithm fall in the Monte Carlo vs. Bootstrapping spectrum?
Is thte method applicable to episodic as well as continuing tasks?
What optimization methods were used: stochastic gradient descent, RMSProp, ADMM, etc.?
What is known about the bias, convergence, and asympmtotic performance of the algorithm?
What advantages does the proposed algorithm have over previous methods?
What special considerations are made for learning/traning?

Experiments and Results

What was learned through training and simulation?
- Experience replay.
- Execute multiple agents in parallel on multiple instances of the environment.
- Minibatches.
- Use entropy regularization on the objective function to avoid premature convergence to a deterministic policy (i.e. to encourage exploration).
How was the data used/reused in the training? (Experience replay seems to reappear in several papers.)
What tricks were used in learning to get the algorithm to converge, to reduce variance, and/or to acceleate convergence?

Tools

What software tools were used to perform learning?
How was the environment modeled and simulated?