statLecture4bx
Download
Report
Transcript statLecture4bx
Corpora and Statistical Methods –
Part 2
Albert Gatt
Preliminaries: Hypothesis testing and the binomial
distribution
Permutations
Suppose we have the 5 words {the, dog, ate, a, bone}
How many permutations (possible orderings) are there of these words?
the dog ate a bone
dog the ate a bone
…
E.g. there are 5! = 120 ways of permuting 5 words.
n! 1... n 1 n
Binomial coefficient
Slight variation:
How many different choices of three words are there out of these 5?
This is known as an “n choose k” problem, in our case: “5 choose 3”
n
n!
k k!(n k )!
For our problem, this gives us 10 ways of choosing three items out of
5
Bernoulli trials
A Bernoulli (or binomial) trial is like a coin flip. Features:
There are two possible outcomes (not necessarily with the
same likelihood), e.g. success/failure or 1/0.
2. If the situation is repeated, then the likelihoods of the two
outcomes are stable.
1.
Sampling with/out replacement
Suppose we’re interested in the probability of pulling out a
function word from a corpus of 100 words.
we pull out words one by one without putting them back
Is this a Bernoulli trial?
we have a notion of success/failure: w is either a function word
(“success”) or not (“failure”)
but our chances aren’t the same across trials: they diminish since we
sample without replacement
Cutting corners
If the sample (e.g. the corpus) is large enough, then we can
assume a Bernoulli situation even if we sample without
replacement.
Suppose our corpus has 52 million words
Success = pulling out a function word
Suppose there are 13 million function words
First trial: p(success) = .25
Second trial: p(success) = 12,999,999/51,999,999 = .249
On very large samples, the chances remain relatively stable even
without replacement.
Binomial probabilities - I
Let π represent the probability of success on a Bernoulli trial
(e.g. our simple word game on a large corpus).
Then, p(failure) = 1 - π
Problem: What are the chances of achieving success 3 times
out of 5 trials?
Assumption: each trial is independent of every other.
(Is this assumption reasonable?)
Binomial probabilities - II
How many ways are there of getting success three times out of 5?
Several: SSSFF, SFSFS, SFSSF, …
To estimate the number of possible ways of getting k outcomes
from n possibilities, we use the binomial coefficient:
5
5!
120
10
3 3!(5 3)! 6 2
Binomial probabilities - III
“5 choose 3” gives 10.
Given independence, each of these sequences is equally likely.
What’s the probability of a sequence?
it’s an AND problem (multiplication rule)
P(SSSFF) = πππ(1- π)(1 – π) = π3(1- π)2
P(SFSFS) = π(1- π) π(1- π) π = π3(1- π)2
(they all come out the same)
Binomial probabilities - IV
The binomial distribution states that:
given n Bernoulli trials, with probability π of success on each trial, the
probability of getting exactly k successes is:
probability of each
success
n k
bk ; n, (1 ) n k
k
probability of k
successes out of n
Number of different ways of
getting k successes
Expected value and variance
Expected value:
E[ X ] n
Var ( X ) n (1 )
where π is our probability of success
Expected value of X over n
trials
Variance of X over n trials
Using the t-test for collocation discovery
The logic of hypothesis testing
The typical scenario in hypothesis testing compares two
hypotheses:
1.
2.
The research hypothesis
A null hypothesis
The idea is to set up our experiment (study, etc) in such a
way that:
If we show the null hypothesis to be false then
we can affirm our research hypothesis with a certain degree
of confidence
H0 for collocation studies
There is no real association between w1 and w2, i.e.
occurrence of <w1,w2> is no more likely than chance.
More formally:
H0: P(w1 & w2) = P(w1)P(w2)
i.e. P(w1) and P(w2) are independent
Some more on hypothesis testing
Our research hypothesis (H1):
<w1,w2> are strong collocates
P(w1 & w2) > P(w1)P(w2)
A null hypothesis H0
P(w1 & w2) = P(w1)P(w2)
How do we know whether our results are sufficient to affirm
H1?
I.e. how big is our risk of wrongly falsifying H0?
The notion of significance
We generally fix a “level of confidence” in advance.
In many disciplines, we’re happy with being 95% confident
that the result we obtain is correct.
So we have a 5% chance of error.
Therefore, we state our results at p = 0.05
“The probability of wrongly rejecting H0 is 5% (0.05)”
Tests for significance
Many of the tests we use involve:
having a prior notion of what the mean/variance of a
population is, according to H0
2. computing the mean/variance on our sample of the
population
3. checking whether the sample mean/variance is different
from the sample predicted by H0, at 95% confidence.
1.
The t-test: strategy
obtain mean (x) and variance (s2) for a sample
H0: sample is drawn from a population with mean μ and
variance σ2
estimate the t value: this compares the sample mean/variance
to the expected (population) mean/variance under H0
check if any difference found is significant enough to reject
H0
Computing t
calculate difference between sample mean and expected population mean
scale the difference by the variance
t
x
s2
N
Assumption: population is normally distributed.
If t is big enough, we reject H0. The magnitude of t given our sample size N is
simply looked up in a table.
Tables tell us what the level of significance is (p-value, or likelihood of making a
Type 1 error, wrongly rejecting H0).
Example: new companies
We think of our corpus as a series of bigrams, and each
sample we take is an indicator variable (Bernoulli trial):
value = 1 if a bigram is new companies
value = 0 otherwise
Compute P(new) and P(companies) using standard MLE.
H0: P(new companies) = P(new)P(companies)
Example continued
We have computed the likelihood of our bigram of interest
under H0.
Since this is a Bernoulli Trial, this is also our expected mean.
We then compute the actual sample probability of <w1,w2>
(new companies).
Compute t and check significance
Uses of the t-test
Often used to rank candidate collocations, rather than
compute significance.
Stop word lists must be used, else all bigrams will be
significant.
e.g. M&S report 824 out of 831 bigrams that pass the significance
test.
Reason:
language is just not random
regularities mean that if the corpus is large enough, all bigrams will
occur together regularly and often enough to be significant.
Kilgarriff (2005): Any null hypothesis will be rejected on a large
enough corpus.
Extending the t-test to compare samples
Variation on the original problem:
what co-occurrence relations are best to distinguish between two
words, w1 and w1’ that are near-synonyms?
e.g. strong vs. powerful
Strategy:
find all bigrams <w1,w2> and <w1, w2’>
e.g. strong tea, strong support
check, for each w1, if it occurs significantly more often with w2,
versus w2’.
NB. This is a two-sample t-test
Two-sample t-test: details
H0: For any w1, the probabilities of <w1,w2> and <w1,w2’> is the same.
i.e. μ (expected difference) = 0
Strategy:
extract sample of <w1,w2> and <w1,w2’>
assume they are independent
compute mean and SD for each sample
compute t
check for significance: is the magnitude of the difference large enough?
Formula:
t
x1 x2
2
2
s1 s2
n1 n2
Simplifying under binomial assumptions
On large samples, variance in the binomial distribution
approaches the mean. I.e.:
x1 P( w1 & w2 ) s
2
1
(similarly for the other sample mean)
Therefore:
t
P( w1 , w2) P( w1 , w2 ' )
P( w1 , w2 ) P( w1 , w2 ' )
N
Concrete example: strong vs. powerful (M&S,
p. 167); NY Times
Words occurring
significantly more often
with powerful than
strong
Words occurring
significantly more often
with strong than
powerful
Criticisms of the t-test
Assumes that the probabilities are normally distributed. This
is probably not the case in linguistic data, where probabilities
tend to be very large or very small.
Alternative: chi-squared test (Χ2)
compare differences between expected and observed frequencies (e.g.
of bigrams)
The chi-square test
Example
Imagine we’re interested in whether poor performance is a good
collocation.
H0: frequency of poor performance is no different from the expected
frequency if each word occurs independently.
Find frequencies of bigrams containing poor, performance and
poor performance.
compare actual to expected frequencies
check if the value is high enough to reject H0
Example continued
OBSERVED FREQUENCIES
f(w1= poor)
F(w1 =/= poor)
f(w2=performance)
15
(poor
performance)
1,230
(bad performance)
F(w2 =/= performance)
3,580
(poor people)
12,000
(all other bigrams)
Expected frequencies need to be computed for each cell:
E.g. expected value for cell (1,1) poor performance:
P(w1 poor ) P(w2 performance) N
Computing the value
The chi-squared value is the sum of differences of observed and
expected frequencies, scaled by expected frequencies.
X
2
(Oi , j Ei , j )
i, j
2
Ei , j
Value is once again looked up in a table to check if degree of
confidence (p-value) is acceptable.
If so, we conclude that the dependency between w1 and w2 is significant.
More applications of this statistic
Kilgarriff and Rose 1998 use chi-square as a measure of corpus
similarity
draw up an n (row)*2 (column) table
columns correspond to corpora
rows correspond to individual types
compare the difference in counts between corpora
H0: corpora are drawn from the same underlying linguistic
population (e.g. register or variety)
corpora will be highly similar if the ratio of counts for each
word is roughly constant.
This uses lexical variation to compute corpus-similarity.
Limitations of t-test and chi-square
Not easily interpretable
a large chi-square or t value suggests a large difference
but makes more sense as a comparative measure, rather than in
absolute terms
t-test is problematic because of the normality assumption
chi-square doesn’t work very well for small frequencies (by
convention, we don’t calculate it if the expected value for any
of the cells is less than 5)
but n-grams will often be infrequent!
Likelihood ratios for collocation discovery
Rationale
A likelihood ratio is the ratio of two probabilities
indicates how much more likely one hypothesis is compared to
another
Notation:
c1 = C(w1)
c2 = C(w2)
c12 = C(<w1,w2>)
Hypotheses:
H0: P(w2|w1) = p = P(w2|¬w1)
H1:
P(w2|w1) = p1
P(w2|¬w1) = p2
p1 =/= p2
Computing the likelihood ratio
H0
P(w2|w1)
p
P(w2|¬w1)
p
Prob. that c12 bigrams
out of c1 are <w1,w2>
Prob. that
c2 - c12 out of N- c1
bigrams are
<¬w1,w2>)
C ( w2 )
N
C ( w2 )
N
b(c12 ; c1 , p)
b(c2 c12 ; N c1 , p)
H1
p1
C ( w1 , w2 )
C ( w1 )
C ( w2 ) C ( w1 , w2 )
p2
N C ( w1 )
b(c12 ; c1 , p1 )
b(c2 c12 ; N c1 , p2 )
Computing the likelihood ratio
The likelihood (odds) that a hypothesis H is correct is L(H).
L( H 0) b(c12 ; c1, p)b(c2 c12 , N c1 , p)
L( H1) b(c12 ; c1, p1 )b(c2 c12 , N c1, p2 )
L( H 0)
L( H 1)
Computing the Likelihood ratio
We usually compute the log of the ratio:
log
L( H 0)
log
L( H 1)
log( L( H 0)) log( L( H 1))
Usually expressed as:
2 log
because, for v. large samples, this is roughly equivalent to a Χ2
value
Interpreting the ratio
Suppose that the likelihood ratio for some bigram <w1,w2>
is x. This says:
If we make the hypothesis that w2 is somehow dependent on w1,
then we expect it to occur x times more than its actual base rate of
occurrence would predict.
This ratio is also better for sparse data.
we can use the estimate as an approximate chi-square value even
when expected frequencies are small.
Concrete example: bigrams involving powerful
(M&S, p. 174)
Source: NY Times
corpus (N=14.3m)
Note: sparse data can
still have a high log
likelihood value!
Interpreting -2 log l as
chi-squared allows us
to reject H0, even for
small samples
(e.g. powerful
cudgels)
Relative frequency ratios
An extension of the same logic of a likelihood ratio
used to compare collocations across corpora
Let <w1,w2> be our bigram of interest.
Let C1 and C2 be two corpora:
p1 = P(<w1,w2>) in C1
p2 = P(<w1,w2>) in C2.
r= p1/p2 gives an indication of the relative likelihood of
<w1,w2> in C1 and C2.
Example application
Manning and Schutze (p.176) compare:
C1: NY Times texts from 1990
C2: NY Times texts from 1989
Bigram <East,Berliners> occurs 44 times in C2, but only 2
times in C1, so r = 0.03
The big difference is due to 1989 papers dealing more with
the fall of the Berlin Wall.
Summary
We’ve now considered two forms of hypothesis testing:
t-test
chi-square
Also, log-likelihood ratios as measures of relative probability
under different hypotheses.
Next, we begin to look at the problem of lexical acquisition.
References
M. Lapata, S. McDonald & F. Keller (1999). Determinants of Adjective-
Noun plausibility. Proceedings of the 9th Conference of the European Chapter of
the Association for Computational Linguistics, EACL-99
A. Kilgarriff (2005). Language is never, ever, ever random. Corpus Linguistics
and Linguistic Theory 1(2): 263
Church, K. and Hanks, P. (1990). Word association norms, mutual
information and lexicography. Computational Linguistics 16(1).