πŸ—’οΈ Ben's Notes

Markov Decision Processes

What is a Markov Decision Process? #

A Markov Decision Process is a Markov model that solves nondeterministic search problems (where an action can result in multiple possible successor states).

A MDP is defined by:

  • A set of states ss
  • A set of actions aa
  • A transition model T(s,a,sβ€²)T(s, a, s’) that represents the probability P(sβ€²βˆ£s,a)P(s’ | s, a) - that the action aa taken at state ss will lead to a new state sβ€²s’. (Allowed by memoryless property)
  • A reward function R(s,a,sβ€²)R(s, a, s’) per transition
  • Discount factor γ∈[0,1]\gamma \in [0, 1]
  • A start state
  • A terminal (absorbing state)

The utility function of an MDP can be calculated as follows:

U([s0,a0,s1,a1,s2,⋯ ])=R(s0,a0,s1)+Ξ³R(s1,a1,s2)+Ξ³2R(s2,a2,s3)+β‹― U([s_0, a_0, s_1, a_1, s_2, \cdots]) = R(s_0, a_0, s_1) + \gamma R(s_1, a_1, s_2) + \gamma^2 R(s_2, a_2, s_3) + \cdots

Intuitively, the discount factor causes an exponential decay over time, so if an agent takes too long to reach a state it is automatically terminated. This also solves the problem of infinite reward streams. (enforce a finite horizon)

  • The reward is bounded by the value Rmax⁑1βˆ’Ξ³\frac{R_{\max}}{1-\gamma}.

The primary goal for an MDP is to find an optimal policy Ο€βˆ—\pi^* that gives us an action for every state that results in the maximum expected utility.

  • The expected utility of a policy UΟ€(s0)U^{\pi}(s_0) is equal to the sum over all possible state sequences multiplied by the probability of them occurring.

Solving MDPs #

The Bellman Equation #

Some values:

  • The utility of a state ss is equal to the expected utility when starting at ss and acting according to the optimal policy.
  • The utility of a Q-state Qβˆ—(s,a)Q^*(s,a) is equal to the expected utility of taken action aa in state ss, then acting optimally.

The Bellman Equation is as follows:

$$ U^(s) = \max_a \sum_{s’} T(s, a, s’)\times(R(s, a,s’) + \gamma U^(s’)) = \max_a Q^*(s, a) $$

  • The equation finds the optimal value of the state ss by multiplying the transition probability to the next state by the reward for that transition plus the discounted utility of the next state.
  • The inner sum term is equivalent to the utility of the Q-state.
  • The equation creates a dynamic programming problem where the subproblem Uβˆ—(sβ€²)U^*(s’) is used to calculate the current state’s utility.

Value Iteration #

Used for computing the optimal values of states, by iterative updates until convergence.

In order to compute the value of a state. we can use the following algorithm:

  1. For all states s∈Ss \in S, initialize their starting value U0(s)=0U_0(s) = 0.
  2. Until convergence (i.e. Uk+1=UkU_{k+1} = U_k), run the following update formula:
    1. βˆ€s∈S,Uk+1(s)=max⁑aβˆ‘sβ€²T(s,a,sβ€²)Γ—(R(s,a,sβ€²)+Ξ³Uk(sβ€²))\forall s \in S, U_{k+1}(s) = \max_a \sum_{s’} T(s, a, s’)\times (R(s, a, s’) + \gamma U_k(s’))
    2. Unlike the Bellman equation, which tests for optimality, this equation changes the value iteratively using dynamic programming.
    3. In other words, this equation takes the max of the values of alll neighboring states, and multiplies it by the transition probability (if nondeterministic).
    4. Uβˆ—(s)U^*(s) for any terminal state must be 00 (since no actions can be taken from these states).

Properties:

  • Value iteration is guaranteed to converge for discounts less than 1.
  • Value iterations will converge to the same Uβˆ—U^* values regardless of initial values.
  • The runtime of value iteration is O(∣S∣2∣A∣)O(|S|^2 |A|) since for every action, we need to compute each action’s Q-value which requires iterating over all states.

Q-value iteration is a similar update algorithm that computes Q-values instead:

Qk+1(s,a)=βˆ‘sβ€²T(s,a,sβ€²)Γ—(R(s,a,sβ€²)+Ξ³max⁑aβ€²Qk(sβ€²,aβ€²)) Q_{k+1}(s,a) = \sum_{s’} T(s, a, s’) \times (R(s, a, s’) + \gamma \max_{a’} Q_k (s’, a’))

  • This is different because for value iteration, we select an action before transitioning, whereas in Q-value iteration, we make the transition before choosing the new state.

Policy extraction: to determine the optimal policy from optimal state values (given a state value function), we can use the following equation:

$$ \forall s \in S, \pi^(s) = \argmax_a Q^(s, a) = \argmax_a \sum_{s’} T(s, a, s’) [R(s, a, s’) + \gamma U^*(s’)]$$

Policy extraction is only optimal if the state value function is optimal. Used either after value iteration (to compute optimal policies from optimal state values) or as a subroutine in policy iteration (to compute the best policy for estimated state values).

Policy Iteration #

Policy iteration has the same optimality guarantees as value iteration, but has better performance. The algorithm is as follows:

  1. Define an initial policy. The closer the initial policy is to the optimal policy, the faster it will converge.
  2. Compute UΟ€(s)=βˆ‘sβ€²T(s,Ο€(s),sβ€²)[R(s,Ο€(s),sβ€²)+Ξ³UΟ€(sβ€²)]U^{\pi}(s) = \sum_{s’} T(s, \pi(s), s’) [R(s, \pi(s), s’) + \gamma U^{\pi}(s’)] for all states ss (the expected utility of starting in state ss when following the current policy Ο€\pi).
  3. Generate the next iteration ii of policy values using the equation above for every state.
  4. Use policy improvement to generate a better policy: Ο€i+1(s)=arg max⁑aβˆ‘sβ€²T(s,a,sβ€²)[R(s,a,sβ€²)+Ξ³UΟ€i(sβ€²)]\pi_{i+1}(s) = \argmax_a \sum_{s’} T(s, a, s’)[R(s, a, s’) + \gamma U^{\pi_i}(s’)]
  5. Repeat steps 2-5 until Ο€i+1=Ο€i=Ο€βˆ—\pi_{i+1} = \pi_i = \pi^*.