Parameter Estimation
Suppose we observe data points ( to ). Let be some unknown parameter that describes the distribution the data points were sampled from.
As an example, let’s say we are trying to quantify how good a product is based on how many positive and negative reviews it has.
- This means that data is either (bad review) or (good review)
- Let . This means that reviews are positive with probability , and negative with probability .
Maximum Likelihood Estimation (MLE) #
http://prob140.org/textbook/content/Chapter_20/01_Maximum_Likelihood.html
In a Frequentist approach, is fixed, so our goal is to find the best estimate for using the data given.
Recall that likelihood is the probability of the data given the parameter,
Goal: Given an iid sample of points , and a distribution described by a parameter , what’s the value of that gives the highest probability of this set of points occurring in the probability distribution? (i.e. we want to maximize likelihood value)
- Formal definition: find that maximizes , where is the probability of one data point occurring given a value of .
- Occurs when
- Calculating derivatives of products is pain, so we can monotonically transform the likelihood function using . Since we can find the maximum of the log likelihood function:
Using MLE to predict CPT values given data points:
- = (# data points with ) / (# data points total)
- = (# data pooints where () ) / (# data points where )
Bayesian Parameter Estimation #
Now, is random. We then need to specify a prior that represents what we believe the distribution for might look like.
In a Bayesian approach, we calculate the posterior distribution , which is an update to the prior now that we have observed some data.
Through Bayes’ Rule, .
- is the likelihood.
- is the prior.
- . This integral is often impossible to compute, especially for high-dimensional data.
Rather than needing to compute , we can simply state that is proportional to the likelihood times the prior, .
A convenient choice for the prior is the Beta distribution.
The Beta(1,1) distribution is equivalent to the Uniform distribution.
Given that the data are sampled from a Bernoulli() distribution, which has a likelihood function (where is the number of positive values out of total values), and a prior of Beta(, we can compute the posterior as such: which is a distribution.
Maximum A Posteriori (MAP) #
Point estimators, one of which is the MAP method, reduce a posterior distribution into a single number.
We can calculate it as the argmax of the posterior with respect to , i.e. the value of that gives us the largest value for the posterior.
MAP is analogous to the mode of the distribution.
Minimum Mean-Squared Error (MMSE) #
MMSE is another point estimator that finds the value of that gives the smallest mean squared error:
MMSE is analogous to the mean of the distribution.
Example: Estimating the Mean with Gaussian (Normal) Likelihood #
Suppose we’re given that people have heights distributed normally. We’re trying to find , which is the population mean.
Then, the likelihood describes the probability of getting any one height given the mean.
Under a frequentist approach, we can calculate the MLE to be .
Under a Bayesian approach, we’re trying to find the posterior .
If the likelhiood and prior ar both normal, then the posterior is also normal. This (along with the beta example with Bernoulli distributions) is an example of a conjugate prior: in general, conuugate priors have the same family as the posterior.