Reinforcement Learning
Introduction #
Reinforcement Learning (RL) is an example of online planning, where agents have no prior knowledge of rewards or transitions and must explore an environment before using an estimated policy.
- Model-based learning: attempts to estimate transition and reward functions with samples attained during exploration before solving MDP with estimates using value or policy iteration
- Model-free learning: attempts to estimate values/Q-values of states directly without construction a reward or transition model in MDP
Passive reinforcement learning: agent is given a policy and learns the values of states under that policy.
- direct evaluation, TD learning
Active reinforcement learning: use feedback to iteratively update policy, eventually learning optimal policy
- Q-learning
Direct Evaluation #
- Fix a policy $\pi$
- Have the agent experience several episodes (group of samples)
- Compute estimated value of any state $s$ by dividing total utility (reward from leaving state) by number of times visited
Direct evaluation loses all transition information, since each stateโs value is computed separately.
Temporal Difference (TD) Learning #
TD learning maintains an estimated value $V^\pi(s)$ for each state by averaging returns achieved from $s$ across all samples.
$$ V^\pi(s) \leftarrow (1-\alpha)V^\pi(s) + \alpha[R(s, \pi(s), s’) + \gamma V^\pi(s’)] $$
- $\alpha$ is a learning rate: higher $\alpha$ should be used when the current estimate is bad, and low $\alpha$ should be used when the current estimate is good.
- $\pi(s)$ is an action sampled from the current policy. This action will lead to a new state $sโ$, whose value will be used to calculate the new value of the current state.
- $V^\pi(s)$ will only converge to the optimal value $V^*(s)$ if the policy used is optimal. TD learning by itself cannot learn this optimal policy.
Q-Learning #
Q-learning is like TD learning, but using Q-values instead of normal utility values.
Regardless of the optimality of the policy used or actions taken, Q-learning will still converge to the optimal policy given a reasonable learning rate and enough exploration.
$$ Q(s, a) \leftarrow (1-\alpha)Q(s,a) + \alpha[R(s, a, s’) + \gamma \max_{a’} Q(s’, a’)] $$
Approximate Q-learning improves Q-learning by using a feature representation for states:
- Let $\textnormal{difference} = [R(s, a, sโ) + \gamma \max_{aโ} Q(sโ, aโ)] - Q(s, a)]$
- Update weights using $w_I \leftarrow w_i + \alpha \times \textnormal{difference} \times f_i(s, a)$
- Update Q-values using $Q(s, a) \leftarrow Q(s, a) + \alpha \cdot \textnormal{difference}$
- Define $Q(s, a) = w \cdot f(s, a)$
- Define $V(s) = w \cdot f(s)$
SARSA variation: use the action performed by the current policy to learn the Q-value. $$Q(s,a) \leftarrow (1-\alpha)Q(s,a) + \alpha(R(s,a,s’) + \max_a Q(s’, a))$$
- only converges if $\alpha$ approaches $0$ over time, and if trajectories include all states
Exploration vs Exploitation #
The following methods distribute time between exploration and exploitation:
Epsilon-Greedy #
Define some probability $\epsilon \in [0, 1]$. With probability $\epsilon$, randomly explore. With probability $1 - \epsilon$, choose the next state based on the current best policy.
Exploration Function #
During the Q-learning update, maximize over some exploration function instead of Q-values:
$$ Q(s, a) \leftarrow (1-\alpha)Q(s,a) + \alpha[R(s,a,s’) + \gamma \max_a’ f(s’, a’)] $$
where $f(s, a) = Q(s, a) + \frac{k}{N(s, a)}$
- $k$ is a constant determining the degree of exploration. Decrease towards 0 over time to decrease the amount of exploration being done.
- $N(s, a)$ is the number of times the Q-state has been visited.