statLecture4bx

Download Report

Transcript statLecture4bx

Corpora and Statistical Methods –
Part 2
Albert Gatt
Preliminaries: Hypothesis testing and the binomial
distribution
Permutations
 Suppose we have the 5 words {the, dog, ate, a, bone}
 How many permutations (possible orderings) are there of these words?
 the dog ate a bone
 dog the ate a bone
 …
 E.g. there are 5! = 120 ways of permuting 5 words.
n! 1... n 1 n
Binomial coefficient
 Slight variation:
 How many different choices of three words are there out of these 5?
 This is known as an “n choose k” problem, in our case: “5 choose 3”
n
n!
  
 k  k!(n  k )!
 For our problem, this gives us 10 ways of choosing three items out of
5
Bernoulli trials

A Bernoulli (or binomial) trial is like a coin flip. Features:
There are two possible outcomes (not necessarily with the
same likelihood), e.g. success/failure or 1/0.
2. If the situation is repeated, then the likelihoods of the two
outcomes are stable.
1.
Sampling with/out replacement
 Suppose we’re interested in the probability of pulling out a
function word from a corpus of 100 words.
 we pull out words one by one without putting them back
 Is this a Bernoulli trial?
 we have a notion of success/failure: w is either a function word
(“success”) or not (“failure”)
 but our chances aren’t the same across trials: they diminish since we
sample without replacement
Cutting corners
 If the sample (e.g. the corpus) is large enough, then we can
assume a Bernoulli situation even if we sample without
replacement.






Suppose our corpus has 52 million words
Success = pulling out a function word
Suppose there are 13 million function words
First trial: p(success) = .25
Second trial: p(success) = 12,999,999/51,999,999 = .249
On very large samples, the chances remain relatively stable even
without replacement.
Binomial probabilities - I
 Let π represent the probability of success on a Bernoulli trial
(e.g. our simple word game on a large corpus).
 Then, p(failure) = 1 - π
 Problem: What are the chances of achieving success 3 times
out of 5 trials?
 Assumption: each trial is independent of every other.
 (Is this assumption reasonable?)
Binomial probabilities - II
 How many ways are there of getting success three times out of 5?
 Several: SSSFF, SFSFS, SFSSF, …
 To estimate the number of possible ways of getting k outcomes
from n possibilities, we use the binomial coefficient:
 5
5!
120
  

 10
 3  3!(5  3)! 6  2
Binomial probabilities - III
 “5 choose 3” gives 10.
 Given independence, each of these sequences is equally likely.
 What’s the probability of a sequence?
 it’s an AND problem (multiplication rule)
 P(SSSFF) = πππ(1- π)(1 – π) = π3(1- π)2
 P(SFSFS) = π(1- π) π(1- π) π = π3(1- π)2
 (they all come out the same)
Binomial probabilities - IV
 The binomial distribution states that:
 given n Bernoulli trials, with probability π of success on each trial, the
probability of getting exactly k successes is:
probability of each
success
n k
bk ; n,      (1   ) n  k
k 
probability of k
successes out of n
Number of different ways of
getting k successes
Expected value and variance
 Expected value:
E[ X ]  n
Var ( X )  n (1   )
 where π is our probability of success
Expected value of X over n
trials
Variance of X over n trials
Using the t-test for collocation discovery
The logic of hypothesis testing
 The typical scenario in hypothesis testing compares two
hypotheses:
1.
2.
The research hypothesis
A null hypothesis
The idea is to set up our experiment (study, etc) in such a
way that:



If we show the null hypothesis to be false then
we can affirm our research hypothesis with a certain degree
of confidence
H0 for collocation studies
 There is no real association between w1 and w2, i.e.
occurrence of <w1,w2> is no more likely than chance.
 More formally:
 H0: P(w1 & w2) = P(w1)P(w2)
 i.e. P(w1) and P(w2) are independent
Some more on hypothesis testing
 Our research hypothesis (H1):
 <w1,w2> are strong collocates
 P(w1 & w2) > P(w1)P(w2)
 A null hypothesis H0
 P(w1 & w2) = P(w1)P(w2)
 How do we know whether our results are sufficient to affirm
H1?
 I.e. how big is our risk of wrongly falsifying H0?
The notion of significance
 We generally fix a “level of confidence” in advance.
 In many disciplines, we’re happy with being 95% confident
that the result we obtain is correct.
 So we have a 5% chance of error.
 Therefore, we state our results at p = 0.05
 “The probability of wrongly rejecting H0 is 5% (0.05)”
Tests for significance

Many of the tests we use involve:
having a prior notion of what the mean/variance of a
population is, according to H0
2. computing the mean/variance on our sample of the
population
3. checking whether the sample mean/variance is different
from the sample predicted by H0, at 95% confidence.
1.
The t-test: strategy
 obtain mean (x) and variance (s2) for a sample
 H0: sample is drawn from a population with mean μ and
variance σ2
 estimate the t value: this compares the sample mean/variance
to the expected (population) mean/variance under H0
 check if any difference found is significant enough to reject
H0
Computing t
 calculate difference between sample mean and expected population mean
 scale the difference by the variance
t
x
s2
N
 Assumption: population is normally distributed.
 If t is big enough, we reject H0. The magnitude of t given our sample size N is
simply looked up in a table.
 Tables tell us what the level of significance is (p-value, or likelihood of making a
Type 1 error, wrongly rejecting H0).
Example: new companies
 We think of our corpus as a series of bigrams, and each
sample we take is an indicator variable (Bernoulli trial):
 value = 1 if a bigram is new companies
 value = 0 otherwise
 Compute P(new) and P(companies) using standard MLE.
 H0: P(new companies) = P(new)P(companies)
Example continued
 We have computed the likelihood of our bigram of interest
under H0.
 Since this is a Bernoulli Trial, this is also our expected mean.
 We then compute the actual sample probability of <w1,w2>
(new companies).
 Compute t and check significance
Uses of the t-test
 Often used to rank candidate collocations, rather than
compute significance.
 Stop word lists must be used, else all bigrams will be
significant.
 e.g. M&S report 824 out of 831 bigrams that pass the significance
test.
 Reason:
 language is just not random
 regularities mean that if the corpus is large enough, all bigrams will
occur together regularly and often enough to be significant.
 Kilgarriff (2005): Any null hypothesis will be rejected on a large
enough corpus.
Extending the t-test to compare samples
 Variation on the original problem:
 what co-occurrence relations are best to distinguish between two
words, w1 and w1’ that are near-synonyms?
 e.g. strong vs. powerful
 Strategy:
 find all bigrams <w1,w2> and <w1, w2’>
 e.g. strong tea, strong support
 check, for each w1, if it occurs significantly more often with w2,
versus w2’.
 NB. This is a two-sample t-test
Two-sample t-test: details
 H0: For any w1, the probabilities of <w1,w2> and <w1,w2’> is the same.
 i.e. μ (expected difference) = 0
 Strategy:
 extract sample of <w1,w2> and <w1,w2’>
 assume they are independent
 compute mean and SD for each sample
 compute t
 check for significance: is the magnitude of the difference large enough?
 Formula:
t
x1  x2
2
2
s1 s2

n1 n2
Simplifying under binomial assumptions
 On large samples, variance in the binomial distribution
approaches the mean. I.e.:
x1  P( w1 & w2 )  s
2
1
 (similarly for the other sample mean)
 Therefore:
t
P( w1 , w2)  P( w1 , w2 ' )
P( w1 , w2 )  P( w1 , w2 ' )
N
Concrete example: strong vs. powerful (M&S,
p. 167); NY Times
Words occurring
significantly more often
with powerful than
strong
Words occurring
significantly more often
with strong than
powerful
Criticisms of the t-test
 Assumes that the probabilities are normally distributed. This
is probably not the case in linguistic data, where probabilities
tend to be very large or very small.
 Alternative: chi-squared test (Χ2)
 compare differences between expected and observed frequencies (e.g.
of bigrams)
The chi-square test
Example
 Imagine we’re interested in whether poor performance is a good
collocation.
 H0: frequency of poor performance is no different from the expected
frequency if each word occurs independently.
 Find frequencies of bigrams containing poor, performance and
poor performance.
 compare actual to expected frequencies
 check if the value is high enough to reject H0
Example continued
OBSERVED FREQUENCIES
f(w1= poor)
F(w1 =/= poor)
f(w2=performance)
15
(poor
performance)
1,230
(bad performance)
F(w2 =/= performance)
3,580
(poor people)
12,000
(all other bigrams)
Expected frequencies need to be computed for each cell:
E.g. expected value for cell (1,1) poor performance:
P(w1  poor )  P(w2  performance)  N
Computing the value
 The chi-squared value is the sum of differences of observed and
expected frequencies, scaled by expected frequencies.
X 
2
(Oi , j  Ei , j )
i, j
2
Ei , j
 Value is once again looked up in a table to check if degree of
confidence (p-value) is acceptable.
 If so, we conclude that the dependency between w1 and w2 is significant.
More applications of this statistic
 Kilgarriff and Rose 1998 use chi-square as a measure of corpus
similarity
 draw up an n (row)*2 (column) table
 columns correspond to corpora
 rows correspond to individual types
 compare the difference in counts between corpora
 H0: corpora are drawn from the same underlying linguistic
population (e.g. register or variety)
 corpora will be highly similar if the ratio of counts for each
word is roughly constant.
 This uses lexical variation to compute corpus-similarity.
Limitations of t-test and chi-square
 Not easily interpretable
 a large chi-square or t value suggests a large difference
 but makes more sense as a comparative measure, rather than in
absolute terms
 t-test is problematic because of the normality assumption
 chi-square doesn’t work very well for small frequencies (by
convention, we don’t calculate it if the expected value for any
of the cells is less than 5)
 but n-grams will often be infrequent!
Likelihood ratios for collocation discovery
Rationale
 A likelihood ratio is the ratio of two probabilities
 indicates how much more likely one hypothesis is compared to
another
 Notation:
 c1 = C(w1)
 c2 = C(w2)
 c12 = C(<w1,w2>)
 Hypotheses:
 H0: P(w2|w1) = p = P(w2|¬w1)
 H1:
 P(w2|w1) = p1
 P(w2|¬w1) = p2
 p1 =/= p2
Computing the likelihood ratio
H0
P(w2|w1)
p
P(w2|¬w1)
p
Prob. that c12 bigrams
out of c1 are <w1,w2>
Prob. that
c2 - c12 out of N- c1
bigrams are
<¬w1,w2>)
C ( w2 )
N
C ( w2 )
N
b(c12 ; c1 , p)
b(c2  c12 ; N  c1 , p)
H1
p1 
C ( w1 , w2 )
C ( w1 )
C ( w2 )  C ( w1 , w2 )
p2 
N  C ( w1 )
b(c12 ; c1 , p1 )
b(c2  c12 ; N  c1 , p2 )
Computing the likelihood ratio
 The likelihood (odds) that a hypothesis H is correct is L(H).
L( H 0)  b(c12 ; c1, p)b(c2  c12 , N  c1 , p)
L( H1)  b(c12 ; c1, p1 )b(c2  c12 , N  c1, p2 )
L( H 0)

L( H 1)
Computing the Likelihood ratio
 We usually compute the log of the ratio:
log 
 L( H 0) 

 log
 L( H 1) 
 log( L( H 0))  log( L( H 1))
 Usually expressed as:
 2 log 
because, for v. large samples, this is roughly equivalent to a Χ2
value
Interpreting the ratio
 Suppose that the likelihood ratio for some bigram <w1,w2>
is x. This says:
 If we make the hypothesis that w2 is somehow dependent on w1,
then we expect it to occur x times more than its actual base rate of
occurrence would predict.
 This ratio is also better for sparse data.
 we can use the estimate as an approximate chi-square value even
when expected frequencies are small.
Concrete example: bigrams involving powerful
(M&S, p. 174)
Source: NY Times
corpus (N=14.3m)
Note: sparse data can
still have a high log
likelihood value!
Interpreting -2 log l as
chi-squared allows us
to reject H0, even for
small samples
(e.g. powerful
cudgels)
Relative frequency ratios
 An extension of the same logic of a likelihood ratio
 used to compare collocations across corpora
 Let <w1,w2> be our bigram of interest.
 Let C1 and C2 be two corpora:
 p1 = P(<w1,w2>) in C1
 p2 = P(<w1,w2>) in C2.
 r= p1/p2 gives an indication of the relative likelihood of
<w1,w2> in C1 and C2.
Example application
 Manning and Schutze (p.176) compare:
 C1: NY Times texts from 1990
 C2: NY Times texts from 1989
 Bigram <East,Berliners> occurs 44 times in C2, but only 2
times in C1, so r = 0.03
 The big difference is due to 1989 papers dealing more with
the fall of the Berlin Wall.
Summary
 We’ve now considered two forms of hypothesis testing:
 t-test
 chi-square
 Also, log-likelihood ratios as measures of relative probability
under different hypotheses.
 Next, we begin to look at the problem of lexical acquisition.
References
 M. Lapata, S. McDonald & F. Keller (1999). Determinants of Adjective-
Noun plausibility. Proceedings of the 9th Conference of the European Chapter of
the Association for Computational Linguistics, EACL-99
 A. Kilgarriff (2005). Language is never, ever, ever random. Corpus Linguistics
and Linguistic Theory 1(2): 263
 Church, K. and Hanks, P. (1990). Word association norms, mutual
information and lexicography. Computational Linguistics 16(1).