🗒️ Ben's Notes

Decision Theory

So far, in binary decision making

and hypothesis testing

Hypothesis Testing Hypothesis testing is a form of (do we accept or reject the null hypothesis?). Formulate null hypothesis, alternate hypothesis,...

11/19/2023

, we’ve explored how to make as few mistakes as possible when making binary predictions.

Intro to Decision Theory #

We can generalize a decision problem to the following:

  • Suppose there is some unknown quantity of interest θ\theta.
    • θ\theta is random under a Bayesian approach, and fixed if frequentist.
  • We collect/observe some data XX.
  • There is some true distribution that the data is drawn from, p(Xθ)p(X | \theta).
  • We are tasked to create a good estimator δ(x)\delta(x) (also known as θ^\hat\theta, which makes decisions based on the data.
  • In order to quantify how good/bad our estimator is, we can use a loss function l(δ(x),θ)l(\delta(x), \theta), where higher loss values are worse.

Loss Functions #

0/1 Loss #

In the case of binary decisions:

  • θ\theta is either 00 or 11 and represents our reality.
  • σ(x)\sigma(x) is also either 00 or 11 and represents our decision.
  • A very simple loss function is to return 00 if the decision matches reality, 11 otherwise. This is known as 0/1 Loss.

L2 Loss #

If our data is continuous, then 0/1 loss cannot be used. Instead, we can define the loss as follows: l(σ(x),θ)=(σ(x)θ)2 l(\sigma(x), \theta) = (\sigma(x) - \theta)^2 This is known as L2L_2 loss.

Applying Loss Functions #

Frequentist Risk #

Under frequentist assumptions, the risk of choosing θ\theta is equivalent to the expectation of the loss function: R(θ)=Exθ[l(σ(x),θ)]=l(σ(x),θ)p(xθ)dxR(\theta) = E_{x|\theta}[l(\sigma(x), \theta)] = \int l(\sigma(x), \theta) p(x|\theta) dx In other words, take the average loss for every possible value, and weight it by how likely it is to get that value.

Bayesian Posterior Risk #

Syntactically, Bayesian risk looks very similar to frequentist risk: ρ(θ)=Eθx[l(σ(x),θ)]=l(σ(x),θ)p(θx)dθ\rho(\theta) = E_{\theta|x}[l(\sigma(x), \theta)] = \int l(\sigma(x), \theta) p(\theta|x) d\theta The main difference is that rather than iterating over every possible value of data weighted by its probability, we iterate over every possible distribution weighted by how likely it is to get our data from it.

Bayes Risk #

Eθ,x[l(σ(x),θ)]=Eθ[Exθ[l(σ(x),θ)]]=Ex[Eθx[l(σ(x),θ)]]E_{\theta, x}[l(\sigma(x), \theta)] = E_\theta[E_{x|\theta}[l(\sigma(x), \theta)]] = E_x[E_{\theta|x}[l(\sigma(x), \theta)]] Bayes risk is the joint expectation of the loss function over all possible data and distributions. It can be represented using either the Frequentist risk or the Bayesian posterior risk.

Bias-Variance Decomposition #

Below is the calculation for the frequentist risk of the L2 loss function: $$ \begin{align} R(\theta) &= E_{x|\theta}\Big[\big( \delta(x) - E_{x|\theta}[\delta(x)] + E_{x|\theta}[\delta(x)] - \theta \big)^2\Big] \\ &= E_{x|\theta}\Big[\big( \delta - \bar{\delta} + \bar{\delta} - \theta \big)^2\Big] \\ &= E_{x|\theta}\Big[ \big(\delta - \bar{\delta}\big)^2 + \underbrace{2\big(\delta - \bar{\delta}\big)\big(\bar{\delta} - \theta\big)}{=0} + \big(\bar{\delta} - \theta\big)^2 \Big] \\ &= E{x|\theta}\Big[\big(\delta - \bar{\delta}\big)^2\Big] + E_{x|\theta}\Big[\big(\bar{\delta} - \theta\big)^2\Big] \\ &= \underbrace{E_{x|\theta}\Big[\big(\delta - \bar{\delta}\big)^2\Big]}{\text{variance of }\delta(x)} + \big(\underbrace{\bar{\delta} - \theta}{\text{bias of }\delta(x))}\big)^2 \\ \end{align} $$ The variance of the estimator is a measurement for how spread out the data is when compared ot the average. Larger variance means that the model is more sensitive to randomness in the data.

The bias of the estimator is a measurement for how far the average of the estimator is from the true value of θ\theta. In other words, if we average out all of the randomness in the data, how close is our model to reality?

If the estimator were perfect, then the expectation of δθ\delta - \theta should be 00, which would make the bias 00.