Reinforcement Learning
Introduction #
Reinforcement Learning (RL) is an example of online planning, where agents have no prior knowledge of rewards or transitions and must explore an environment before using an estimated policy.
- Model-based learning: attempts to estimate transition and reward functions with samples attained during exploration before solving MDP with estimates using value or policy iteration
- Model-free learning: attempts to estimate values/Q-values of states directly without construction a reward or transition model in MDP
Passive reinforcement learning: agent is given a policy and learns the values of states under that policy.
- direct evaluation, TD learning
Active reinforcement learning: use feedback to iteratively update policy, eventually learning optimal policy
- Q-learning
Direct Evaluation #
- Fix a policy
- Have the agent experience several episodes (group of samples)
- Compute estimated value of any state by dividing total utility (reward from leaving state) by number of times visited
Direct evaluation loses all transition information, since each state’s value is computed separately.
Temporal Difference (TD) Learning #
TD learning maintains an estimated value for each state by averaging returns achieved from across all samples.
- is a learning rate: higher should be used when the current estimate is bad, and low should be used when the current estimate is good.
- is an action sampled from the current policy. This action will lead to a new state , whose value will be used to calculate the new value of the current state.
- will only converge to the optimal value if the policy used is optimal. TD learning by itself cannot learn this optimal policy.
Q-Learning #
Q-learning is like TD learning, but using Q-values instead of normal utility values.
Regardless of the optimality of the policy used or actions taken, Q-learning will still converge to the optimal policy given a reasonable learning rate and enough exploration.
Approximate Q-learning improves Q-learning by using a feature representation for states:
- Let
- Update weights using
- Update Q-values using
- Define
- Define
SARSA variation: use the action performed by the current policy to learn the Q-value.
- only converges if approaches over time, and if trajectories include all states
Exploration vs Exploitation #
The following methods distribute time between exploration and exploitation:
Epsilon-Greedy #
Define some probability . With probability , randomly explore. With probability , choose the next state based on the current best policy.
Exploration Function #
During the Q-learning update, maximize over some exploration function instead of Q-values:
where
- is a constant determining the degree of exploration. Decrease towards 0 over time to decrease the amount of exploration being done.
- is the number of times the Q-state has been visited.