Parameter Estimation. Learning From Data: MLE. Parameter Estimation. Likelihood. Maximum Likelihood Parameter Estimation. Likelihood Function 12/1/16

Size: px

Start display at page:

Download "Parameter Estimation. Learning From Data: MLE. Parameter Estimation. Likelihood. Maximum Likelihood Parameter Estimation. Likelihood Function 12/1/16"

Olivia Lawson
5 years ago
Views:

Learning From Data: MLE Maximum Estimators Common approach in statistics: use a parametric model of data: Assume data set: Bin(n, p), Poisson( ),

.., xn is from a parametric distribution f(x ), estimate. E.g.

with parameter P(x ): Probability of event x given model Viewed as a function of x (fixed ), its a probability E.g., Σx P(x θ) = Viewed as a function of (fixed x), its a likelihood E.

e., event HHTHH is more likely when θ =.6 than θ =.5 And what θ make HHTHH most likely?

1 Learning From Data: MLE Maximum Estimators Common approach in statistics: use a parametric model of data: Assume data set: Bin(n, p), Poisson( ), N(µ, exp( ) Uniform(a, b) 2 ) But parameters are unknown!!! Need to estimate them. 2 Assuming sample x, x2,..., xn is from a parametric distribution f(x ), estimate. E.g.: Given sample HHTTTTTHTHTTTHH of (possibly biased) coin flips, estimate = probability of Heads f(x ) is the Bernoulli probability mass function with parameter P(x ): Probability of event x given model Viewed as a function of x (fixed ), its a probability E.g., Σx P(x θ) = Viewed as a function of (fixed x), its a likelihood E.g., Σθ P(x θ) can be anything; relative values of interest. E.g., if θ = prob of heads in a sequence of coin flips then P(HHTHH.6) > P(HHTHH.5), I.e., event HHTHH is more likely when θ =.6 than θ =.5 And what θ make HHTHH most likely? 4 Function P( HHTHH ): Probability of HHTHH, given P(H) = : 4 (-) max Maximum One (of many) approaches to param. est. of (indp) observations x, x 2,..., x n As a function of θ, what θ maximizes the likelihood of the data actually observed Typical approach: or 6

$.., xn; n0 tails, n heads, n0 + n = n; θ = probability of heads (Also verify its max, not min, & not better on boundary) Observed fraction of successes in$ sample is MLE of success probability in population 7 8 Ex2: I got data; a little birdie tells me its normal, and promises 2 = Assuming sample x, x2,.

sample is MLE of success probability in population 7 8 Ex2: I got data; a little birdie tells me its normal, and promises 2 = Assuming sample x, x2,.

2 Example n coin flips, x, x2,..., xn; n0 tails, n heads, n0 + n = n; θ = probability of heads Example n coin flips, x, x2,..., xn; n0 tails, n heads, n0 + n = n; θ = probability of heads (Also verify its max, not min, & not better on boundary) Observed fraction of successes in sample is MLE of success probability in population 7 8 Ex2: I got data; a little birdie tells me its normal, and promises 2 = Assuming sample x, x2,..., xn is from a parametric distribution f(x ), estimate. E.g.: Given n normal samples, estimate mean & variance 9 x 0 Which is more likely: (a) this? Which is more likely: (b) or this? unknown, 2 = unknown, 2 = 2 2

3 unknown, 2 = unknown, 2 = Looks good by eye, but how do I optimize my estimate of? 4 Ex. 2: Ex. 2: And verify its max, not min & not better on boundary And verify its max, not min & not better on boundary population mean 5 population mean 6 Hmm, density probability So why is likelihood function equal to product of densities?? a) for maximizing likelihood, we really only care about relative likelihoods, and density captures that and/or b) if density at x is f(x), for any small δ>0, the probability of a sample within ±δ/2 of x is δf(x), but δ is constant wrt θ, so it just drops out of Ex: I got data; a little birdie tells me its normal (but does not tell me 2 ) 7 8 d/dθ log L( ) = 0. x

4 Which is more likely: (a) this?, 2 both unknown Which is more likely: (b) or this?, 2 both unknown ± ± 9 20, 2 both unknown Which is more likely: (d) or this?, 2 both unknown ± ± Which is more likely: (d) or this?, 2 both unknown Looks good by eye, but how do I optimize my estimates of & 2? Ex : ± 0.5 surface 2 θ θ

estimate parameters from data You choose the form of the model (normal, binomial,.

5 Ex : Ex., (cont.) surface θ θ 2 population mean, again In general, a problem like this results in 2 equations in 2 unknowns. Easy in this case, since 2 drops out of the / = 0 equation25 Sample variance is MLE of population variance 26 Summary MLE is one way to estimate parameters from data You choose the form of the model (normal, binomial,...) Math chooses the value(s) of parameter(s) Has the intuitively appealing property that the parameters maximize the likelihood of the observed data; basically just assumes your sample is representative 27 5

Bayesian Methods. David Rosenberg. April 11, New York University. David Rosenberg (New York University) DS-GA 1003 April 11, / 19

Bayesian Methods. David Rosenberg. April 11, New York University. David Rosenberg (New York University) DS-GA 1003 April 11, / 19 Bayesian Methods David Rosenberg New York University April 11, 2017 David Rosenberg (New York University) DS-GA 1003 April 11, 2017 1 / 19 Classical Statistics Classical Statistics David Rosenberg (New