🗒️ Ben's Notes

Parameter Estimation

Suppose we observe nn data points (x1x_1 to xnx_n). Let θ\theta be some unknown parameter that describes the distribution the data points were sampled from.

As an example, let’s say we are trying to quantify how good a product is based on how many positive and negative reviews it has.

  • This means that data xix_i is either 00 (bad review) or 11 (good review)
  • Let p(xiθ)=θxi(1θ)1xip(x_i | \theta) = \theta^{x_i} (1-\theta)^{1-x_i}. This means that reviews are positive with probability θ\theta, and negative with probability 1θ1-\theta.

Maximum Likelihood Estimation (MLE) #

http://prob140.org/textbook/content/Chapter_20/01_Maximum_Likelihood.html

In a Frequentist approach, θ\theta is fixed, so our goal is to find the best estimate for θ\theta using the data given.

Recall that likelihood is the probability of the data given the parameter, p(xiθ)p(x_i | \theta)

Goal: Given an iid sample of NN points x1,,xNx_1, \cdots, x_N, and a distribution described by a parameter θ\theta, what’s the value of θ\theta that gives the highest probability of this set of points occurring in the probability distribution? (i.e. we want to maximize likelihood value)

  • Formal definition: find θ\theta that maximizes L(θ)=i=1NPθ(xi)L(\theta) = \prod_{i=1}^N P_\theta(x_i), where PθP_\theta is the probability of one data point xix_i occurring given a value of θ\theta.
  • MLE(θX=x)=argmaxθP(X=xθ)=argmaxθlnP(X=xθ)MLE(\theta | X=x) = argmax_\theta P(X=x|\theta) = argmax_\theta \ln P(X=x | \theta)
  • Occurs when θL(θ)=0\frac{\partial}{\partial \theta} L(\theta) = 0
  • Calculating derivatives of products is pain, so we can monotonically transform the likelihood function using log\log. Since max(f(x))=max(log(f(x))\max(f(x)) = \max(\log(f(x)) we can find the maximum of the log likelihood function: logL(θ)\log L(\theta)

Using MLE to predict CPT values given data points:

  • P(Y=y)=MLE(θ(X,Y))P(Y=y) = MLE(\theta | (X,Y)) = (# data points with X=xX=x) / (# data points total)
  • P(X=xY=y)=MLE(θ(X,Y))P(X=x|Y=y) = MLE(\theta | (X,Y)) = (# data pooints where (X=x,Y=yX=x, Y=y) ) / (# data points where Y=yY=y)

Bayesian Parameter Estimation #

Now, θ\theta is random. We then need to specify a prior p(θ)p(\theta) that represents what we believe the distribution for θ\theta might look like.

In a Bayesian approach, we calculate the posterior distribution p(θx)p(\theta|x), which is an update to the prior now that we have observed some data.

Through Bayes’ Rule, p(θx)=p(xθ)p(θ)p(x)p(\theta|x) = \frac{p(x|\theta)p(\theta)}{p(x)}.

  • p(xθ)p(x|\theta) is the likelihood.
  • p(θ)p(\theta) is the prior.
  • p(x)=p(xθ)p(θ)dθp(x) = \int p(x|\theta)p(\theta)d\theta. This integral is often impossible to compute, especially for high-dimensional data.

Rather than needing to compute p(x)p(x), we can simply state that p(θx)p(\theta|x) is proportional to the likelihood times the prior, p(xθ)p(θ)p(x|\theta)p(\theta).

A convenient choice for the prior is the Beta distribution.

p(θ)θα1(1θ)β1=Beta(α,β)p(\theta) \propto \theta^{\alpha - 1}(1-\theta)^{\beta - 1} = Beta(\alpha,\beta) The Beta(1,1) distribution is equivalent to the Uniform distribution.

Given that the data xix_i are sampled from a Bernoulli(θ\theta) distribution, which has a likelihood function p(xθ)=θk(1θ)nkp(x|\theta) = \theta^{k} (1-\theta)^{n - k} (where kk is the number of positive values out of nn total values), and a prior of Beta(α,β)\alpha, \beta), we can compute the posterior as such: p(θx)p(xθ)p(θ)(θk(1θ)nk)(θα1(1θ)β1)p(\theta|x) \propto p(x|\theta)p(\theta) \propto (\theta^{k} (1-\theta)^{n - k}) \cdot (\theta^{\alpha - 1}(1 - \theta)^{\beta - 1}) θk+α1(1θ)nk+β1\propto \theta^{k+\alpha-1} (1-\theta)^{n-k+\beta - 1} which is a Beta(k+α,nk+β)Beta(k + \alpha, n-k+\beta) distribution.

Maximum A Posteriori (MAP) #

Point estimators, one of which is the MAP method, reduce a posterior distribution into a single number.

We can calculate it as the argmax of the posterior with respect to θ\theta, i.e. the value of θ\theta that gives us the largest value for the posterior.

MAP is analogous to the mode of the distribution.

Minimum Mean-Squared Error (MMSE) #

MMSE is another point estimator that finds the value of θ\theta that gives the smallest mean squared error:

θ^=argmaxθ^Eθx(θ^θ)2\hat\theta = argmax_{\hat\theta} E_{\theta|x} (\hat\theta - \theta)^2 MMSE is analogous to the mean of the distribution.

Example: Estimating the Mean with Gaussian (Normal) Likelihood #

Suppose we’re given that nn people have heights x1,,xnx_1, \cdots, x_n distributed normally. We’re trying to find μ\mu, which is the population mean.

Then, the likelihood p(xiμ)=N(xi;μ,σ)p(x_i | \mu) = N(x_i; \mu, \sigma) describes the probability of getting any one height given the mean.

Under a frequentist approach, we can calculate the MLE to be μ^MLE=1ni=1nxi\hat\mu_{MLE} = \frac{1}{n} \sum_{i=1}^n x_i.

Under a Bayesian approach, we’re trying to find the posterior p(μx)p(xμ)p(μ)p(\mu | x) \propto p(x | \mu) p(\mu).

If the likelhiood and prior ar both normal, then the posterior is also normal. This (along with the beta example with Bernoulli distributions) is an example of a conjugate prior: in general, conuugate priors have the same family as the posterior.