Statistical Methods for NLP LT 2202

Size: px

Start display at page:

Download "Statistical Methods for NLP LT 2202"

Brittney Benson
5 years ago
Views:

1 LT 2202 Lecture 5 Statistical inference January 31, 2012

2 Summary of lecture 4 Probabilities and statistics in Python Scipy Matplotlib Descriptive statistics Random sample Sample mean Sample variance and standard deviation Plotting sample distribution: histograms Correlation

3 pmf and cdf in Python Import the uniform distribution (randint): from scipy.stats import randint What is the probability of a die rolling 4? randint.pmf(4, 1, 7) What is the probability of rolling 4 or below? randint.cdf(4, 1, 7) What is the probability of rolling between 2 and 4? randint.cdf(4, 1, 7) - randint.cdf(1, 1, 7)

4 One additional note: quantiles (ppf) Document classifier with error rate of 0.20 applied to 100 documents Import the binomial distribution (binom): from scipy.stats import binom 13 errors or less? binom.cdf(13, 100, 0.20) gives errors or less? binom.cdf(14, 100, 0.20) gives Smallest k so that P(X k) is at least 0.05? binom.ppf(0.05, 100, 0.20) gives 14

5 The pmf

6 Example: human height The average height of a Swedish female (16-84 years) is cm The standard deviation (σ) is 7 cm Assume normal distribution Import the normal distribution (norm): from scipy.stats import norm Probability of being at most 154 cm tall? norm.cdf(154, 165.5, 7) gives 0.05 How short are the shortest 5% of the population? norm.ppf(0.05, 165.5, 7) gives 154

7 Statistical inference: overview Given a random sample, how do we estimate some parameter of the distribution? What is the error rate of my tagger? What is the probability of the word green? determine some interval that is very likely to contain the true value? 95% confidence interval for the error rate test some hypothesis about the parameter? Is the error rate greater than 0.03? Is the error rate of tagger A greater than that of tagger B?

8 Random sample A random sample is a set of values generated by some random variable Typically generated by carrying out some repeated experiment Examples: Running a tagger on a text and counting errors Word and sentence lengths in a corpus

9 Random sample: formally Let s assume we have a random variable X Definition: a sample variable for X is a set of independent variables X 1,..., X n with the same distribution as X Definition: a random sample x 1,..., x n of X is a possible outcome of the sample variable X 1,..., X n

10 Random sample: example Let s assume we have a random variable: X = the roll of a die Sample variable: X 1,..., X 3 = the rolls of 3 dice A possible random sample: 5, 1, 4

11 Point estimates Given a dataset, how do we estimate some parameter of the random variable that generated the data? An estimator is a function that guesses a parameter value given a dataset

12 Maximum likelihood estimates There are many ways to construct estimators Most common: the maximum likelihood method: Select the parameter value that maximizes the probability of the data

13 Maximum likelihood estimates Select the parameter value that maximizes the probability of the data Mathematically, define the likelihood function L(p) like this: L( p) = P( x1,..., xn p) = P( x1 p)... P( xn p) Then find the p* that maximizes L(p) We ll now look at one special case

14 Maximum likelihood estimation of the probability of an event We carry out an experiment n times, and we get a positive outcome x times How do we estimate the probability p of a positive outcome?

15 Estimating the parameter This is a binomial distribution with parameters n and p Maximum likelihood estimation: find the p* that makes x most likely Maximize L( p) n x x n x = P( x p) = p (1 p)

16 Estimating the parameter Maximize L( p) n x x n x = P( x p) = p (1 p) It can be shown that the ML estimation is * p = x n

17 Intuition: move the bubble For instance: 100 documents, 8 errors Move the bubble to maximize the probability of 8 errors

18 ML estimation of word probabilities We observe the words in a corpus of 1,173,766 words: the: 50,975 times big: 559 times dog: 10 times Assuming a unigram model: what is the probability of the? ML estimation: p_the = /

19 The probabilities of rare events Sentence probability in the unigram model: P(the big dog) = P(the)P(big)P(dog)= = * * P(the big donut) = P(the)P(big)P(donut)= = * * 0

20 Laplace s law: add one to all counts Vocabulary size: 49,206 words P(the) = ( )/( ) P(big) = ( )/( ) P(dog) = ( )/( ) P(donut) = (0 + 1)/( ) P(the big donut) = P(the)P(big)P(donut)= = * *

21 Evaluating performance When evaluating NLP systems, several performance measures can be interpreted as probabilities. error rate = 1 - accuracy Precision / recall False positive rate / true positive rate We estimate all these using ML

22 Performance measures as probabilities Error rate = P(error), accuracy = P(correct) Precision = P(positive guess positive) Recall = TPR = P(guess positive positive) FPR = P(guess positive negative)

23 ML estimates of performance measures Error rate = P(error) MLE: #errors / # tests Precision = P(positive guess positive) MLE: #positive and guess positive / #guess positive Recall = TPR = P(guess positive positive) MLE: #positive and guess positive / #positive

24 Interval estimates If our estimator gives us a value of a parameter: how close is it to the true value? Definition: a confidence interval for the parameter θ with significance value α is an interval [θ 1, θ 2 ] so that P θ θ θ ) ( 1 2 α Example: error rate between 0.05 and 0.08 with 95% probability

25 The distribution of our estimator Our estimator applied to randomly selected samples has a distribution Depends on the sample size

26 Estimator distribution / sample size

27 Computing a confidence interval If we have made a point estimate θ*, how can we compute an interval that contains the true θ with 95% probability? Impractical for most distributions We ll give an approximate method for the case of error/success rates

28 Cookbook method for error rate confidence interval Pretend that binomial is normal The true variance is p(1-p)/n; use p* instead Then we can use the following approximate confidence interval: σ * I p = p ± = * z α σ * * p (1 p ) n *

29 Explanation of formula * I p = p ± * z α σ σ = * * p (1 p ) n * z α is the value such that P( z z ) α < X < α = if X is normally distributed α

30 Normal quantile in Python z α is is the value such that P( z z ) α < X < α = α If X is normally distributed In Python: z x : norm.ppf(1-(1-x)/2)) z 0.95 : norm.ppf(0.975) z 0.99 : norm.ppf(0.995) z : norm.ppf(0.9995)

31 Example in Python Assume we test on n=10,000 documents and make n err =745 errors. p_mle = n_err/n sd_est = math.sqrt(p_mle*(1-p_mle)/n) z95 = norm.ppf(0.975) p_upper = p_mle + z95*sd_est p_lower = p_mle - z95*sd_est

32 Comparing performance measurements If we evaluate two NLP tools and get the estimated error rates p* 1 and p* 2 How can we say that a difference is not due to chance? We distinguish two cases: Estimated on different test sets Estimated on the same test set

33 Performance estimated on different test sets Define d* as the difference between the estimated error rates p* 1 and p* 2 Now we give a cookbook method for computing a confidence interval I d for d If I d does not include 0, we can say that the difference is real

34 Confidence interval for the difference σ * * * * * p1 (1 p1 ) p2 (1 p2 ) = + n 1 n 2 * I d = d ± * z α σ

35 Example in Python Assume we test tagger 1 on n 1 =2,000 documents and make x 1 =67 errors, and tagger 2 n 2 =1,500 and make x 2 =68 errors n1 = ; n2 = ; x1 = 67.0; x2 = 68.0 p1_mle = x1/n1; p2_mle = x2/n2; d_mle = p1_mle-p2_mle sd_est = math.sqrt(p1_mle*(1-p1_mle)/n1 + p2_mle*(1- p2_mle)/n2) z95 = norm.ppf(0.975) d_upper = d_mle + z95*sd_est d_lower = d_mle - z95*sd_est

36 Performance estimated on the same test set There are many such tests We ll present one of the simplest: McNemar s test:

37 McNemar s test Make a 2x2 contingency table: System 1 OK System 1 error System2 OK a c System 2 error b d We are interested in the differences: b and c

38 McNemar s test Form the test quantity h: ( b c) b + c If h > threshold, we have a significant difference: threshold In Python (α = significance level): from scipy.stats import chi2 threshold = chi2.ppf(alpha, 1) h = 2 = χ α (1) 2

39 Example System 1 OK System 1 error System 2 OK System 2 error Err rate 1 = 0.10, err rate 2 = 0.13 Significant difference? We form the test quantity: h = ( b c) b + c 2 = (46 29) = 3.85

40 McNemar s test in Python from scipy.stats import chi2 b = 46.0; c = 29.0 alpha = 0.95 threshold = chi2.ppf(alpha, 1) h = (b-c)*(b-c)/(b+c) if h > threshold: print 'Significant at level', alpha Threshold = 3.84, h = 3.85!

41 Summary Point estimates: Given a dataset, how do I estimate my parameters? Interval estimates: Given a dataset, how do I compute an interval likely to contain true value? Comparing performance estimates

Chapter 6 Normal Probability Distributions

Chapter 6 Normal Probability Distributions 6-1 Review and Preview 6-2 The Standard Normal Distribution 6-3 Applications of Normal Distributions 6-4 Sampling Distributions and Estimators 6-5 The Central