Parameter Estimation

Suppose we observe $n$ data points ($x_1$ to $x_n$). Let $\theta$ be some unknown parameter that describes the distribution the data points were sampled from.

As an example, let’s say we are trying to quantify how good a product is based on how many positive and negative reviews it has.

This means that data $x_i$ is either $0$ (bad review) or $1$ (good review)
Let $p(x_i | \theta) = \theta^{x_i} (1-\theta)^{1-x_i}$. This means that reviews are positive with probability $\theta$, and negative with probability $1-\theta$.

Maximum Likelihood Estimation (MLE) #

http://prob140.org/textbook/content/Chapter_20/01_Maximum_Likelihood.html

MLE also shows up in the CS 188 ML pipeline for fitting CPTs from data — see Machine Learning (Naive Bayes section). For the broader probability background, see probability-overview.

In a Frequentist approach, $\theta$ is fixed, so our goal is to find the best estimate for $\theta$ using the data given.

Recall that likelihood is the probability of the data given the parameter,

$$p(x_i | \theta)$$

Goal: Given an iid sample of $N$ points $x_1, \cdots, x_N$, and a distribution described by a parameter $\theta$, what’s the value of $\theta$ that gives the highest probability of this set of points occurring in the probability distribution? (i.e. we want to maximize likelihood value)

Formal definition: find $\theta$ that maximizes $L(\theta) = \prod_{i=1}^N P_\theta(x_i)$, where $P_\theta$ is the probability of one data point $x_i$ occurring given a value of $\theta$.
$MLE(\theta | X=x) = argmax_\theta P(X=x|\theta) = argmax_\theta \ln P(X=x | \theta)$
Occurs when $\frac{\partial}{\partial \theta} L(\theta) = 0$
Calculating derivatives of products is pain, so we can monotonically transform the likelihood function using $\log$. Since $\max(f(x)) = \max(\log(f(x))$ we can find the maximum of the log likelihood function: $\log L(\theta)$

Using MLE to predict CPT values given data points:

$P(Y=y) = MLE(\theta | (X,Y))$ = (# data points with $Y=y$) / (# data points total)
$P(X=x|Y=y) = MLE(\theta | (X,Y))$ = (# data points where ($X=x, Y=y$) ) / (# data points where $Y=y$)

Bayesian Parameter Estimation #

Now, $\theta$ is random. We then need to specify a prior $p(\theta)$ that represents what we believe the distribution for $\theta$ might look like.

In a Bayesian approach, we calculate the posterior distribution $p(\theta|x)$, which is an update to the prior now that we have observed some data.

Through Bayes’ Rule, $p(\theta|x) = \frac{p(x|\theta)p(\theta)}{p(x)}$.

$p(x|\theta)$ is the likelihood.
$p(\theta)$ is the prior.
$p(x) = \int p(x|\theta)p(\theta)d\theta$. This integral is often impossible to compute, especially for high-dimensional data.

Rather than needing to compute $p(x)$, we can simply state that $p(\theta|x)$ is proportional to the likelihood times the prior, $p(x|\theta)p(\theta)$.

A convenient choice for the prior is the Beta distribution.

$$p(\theta) \propto \theta^{\alpha - 1}(1-\theta)^{\beta - 1} = Beta(\alpha,\beta)$$

The Beta(1,1) distribution is equivalent to the Uniform distribution.

Given that the data $x_i$ are sampled from a Bernoulli($\theta$) distribution, which has a likelihood function $p(x|\theta) = \theta^{k} (1-\theta)^{n - k}$ (where $k$ is the number of positive values out of $n$ total values), and a prior of Beta($\alpha, \beta)$, we can compute the posterior as such:

$$p(\theta|x) \propto p(x|\theta)p(\theta) \propto (\theta^{k} (1-\theta)^{n - k}) \cdot (\theta^{\alpha - 1}(1 - \theta)^{\beta - 1})$$

$$\propto \theta^{k+\alpha-1} (1-\theta)^{n-k+\beta - 1}$$

which is a $Beta(k + \alpha, n-k+\beta)$ distribution.

Maximum A Posteriori (MAP) #

Point estimators, one of which is the MAP method, reduce a posterior distribution into a single number.

We can calculate it as the argmax of the posterior with respect to $\theta$, i.e. the value of $\theta$ that gives us the largest value for the posterior.

MAP is analogous to the mode of the distribution.

Minimum Mean-Squared Error (MMSE) #

MMSE is another point estimator that finds the value of $\theta$ that gives the smallest mean squared error:

$$\hat\theta = argmin_{\hat\theta} E_{\theta|x} (\hat\theta - \theta)^2$$

MMSE is analogous to the mean of the distribution.

Example: Estimating the Mean with Gaussian (Normal) Likelihood #

Suppose we’re given that $n$ people have heights $x_1, \cdots, x_n$ distributed normally. We’re trying to find $\mu$, which is the population mean.

Then, the likelihood $p(x_i | \mu) = N(x_i; \mu, \sigma)$ describes the probability of getting any one height given the mean.

Under a frequentist approach, we can calculate the MLE to be $\hat\mu_{MLE} = \frac{1}{n} \sum_{i=1}^n x_i$.

Under a Bayesian approach, we’re trying to find the posterior $p(\mu | x) \propto p(x | \mu) p(\mu)$.

If the likelihood and prior are both normal, then the posterior is also normal. This (along with the beta example with Bernoulli distributions) is an example of a conjugate prior: in general, conjugate priors have the same family as the posterior.

ben's notes