Statistics - Personal Webpages (The University of Manchester)

Download Report

Transcript Statistics - Personal Webpages (The University of Manchester)

Statistics
MP Oakes (1998) Statistics for corpus
linguistics. Edinburgh University Press
Rock bottom basics
• Central tendency
– With any set of numerical scores (eg frequency
counts of word types, lengths of sentences in a
corpus)
– mode (most frequently obtained score)
• Easily affected by chance scores
– median (the score nearest the middle of the range
of scores)
• Will be close to mean if data evenly distributed
– mean (average)
x in equations
2
Rock bottom basics
• Probability of an event a, usually written P(a)
– For a set of alternative events, total of all probabilities is 1
– Events assumed to be independent
• This can be counter-intuitive, but (in coin toss) the chance of
heads is always 1/2, whatever the preceding tosses were
• Probability of an event a, given some other condition
b is written P(a|b)
– Notice that P(a|b) is independent of P(b) - eg P(skelter|helter)
• Not to be confused with the probability of two events
co-occurring
– written P(a,b)
– which is not the same as the combined probability P(a)  P(b)
3
Simple word counts
• A simple frequency count on its own might
not tell you anything
• Need to compare it with something else
– Frequency counts of other similar things
– Or the frequency count that you might expect
on average
• Then need to see if the measured
difference is significant
4
Statistical significance
• Probably most commonly used statistic in all
social science is t-test
• Understood that any result could be due to
random chance
• Statistical significance tells you what level of
random chance would be responsible for the
result you get
• Usually involves looking something up in a table
– Level of certainty
– Number of variables or degrees of freedom
5
Correlation
• Frequency counts might provide an ordered list
• You might want to compare counts of two things
to see if they are correlated, eg word length in
English and number of characters in Chinese (Xu
1996)
• Person’s rho

N  x
N  xy   x y
2

  x  N  y   y 
2
2
2

• There’s also a formula for rank correlation
6
Xu (1996)
X
Y
sqr(X)
sqr(Y)
XY
1
2
1
4
2
2
1
4
1
2
2
2
4
4
4
3
1
9
1
3
3
2
9
4
6
4
2
16
4
8
6
2
36
4
12
6
3
36
9
18
7
1
49
1
7
7
2
49
4
14
8
2
64
4
16
9
2
81
4
18
10
2
100
4
20
11
2
121
4
22
11
3
121
9
33
TOTAL
N=15
29
700
61


N  x
N  xy   x  y
2

  x  N  y 2   y 
2
2

15 185  90  29
15  700  90  90 15  61  29  29
165
2400  74
 0.39

Critical value for 15 pairs of
observations at 5% level of confidence
is 0.441, so result is not statistically
significant
(it is at 10% level though)
185
7
Comparison with expected values
• We might want to compare relative frequencies
of a range of features
• Chi-square test shows if frequency differences
are significant
2


O

E
2  
E
• where O is observed value, E is expected
value
row total  column tot al
E
grand total
8
Yamamoto (1996)
• Frequencies of types of 3rd-person reference in
English and Japanese
Japanese
Ellipsis
English
TOTALS
E(J)
E(E)
X2(J)
X2(E)
104
0
104
48.60
55.40
63.14
55.40
Central pronouns
73
314
387
180.86
206.14
64.32
56.43
Non-central pronouns
12
28
40
18.69
21.31
2.40
2.10
Names
314
291
605
282.73
322.27
3.46
3.03
Common NPs
205
174
379
177.12
201.88
4.39
3.85
TOTAL
708
807
1515
• Sum = 258.8, significant for (5-1)x(2-1)=4dfs at
0.1% level
9
Co-occurrence
• Is distribution of two
things correlated?
• Contingency table
– eg sentences where
two words co-occur or
not
• Phi coefficient
• Dice’s coefficient
• Several variants
W1
not W1
W2
a
b
not W2
c
d
   2 (a  b  c  d )
2a  b 
s
abcd
10
Co-occurrence
• Scores such as Dice’s
coefficient need to be
turned into something
like a t score, so that
significance can be
measured
f ( x) f ( y )
f ( x, y ) 
N
t
f ( x, y )
ab
( a  b) 
abcd
t
( a  b)
11
Co-occurrence
 Mutual information
• Measures the
relatedness of two
variables
• compares joint and
combined Ps
• P  0 = chance
association
• P>>0 strong
association
• P<<0 complementary
distribution
I ( X ; Y )  log 2
P ( x, y )
P( x) P( y )
In terms of contingency matrix:
a
P( x) 
abcd
b
P( y ) 
abcd
ab
P ( x, y ) 
abcd
12
Church & Hanks (1990)
• Used MI to show word associations
– Eg doctors + {dentitsts,nurses,treating,treat,
examine,bills,hospitals}
– In contrast with doctors + {with,a,is}
– Identify phrasal verbs eg set + {up,off,out,in}
but not about
– Using a parser to separate N and V readings,
most likely objects of verb drink
– What you can do to a telephone (sit by,
disconnect, answer, …)
13
14
Church et al. (1991)
• strong vs powerful
experiment
MI
word pair
MI
word pair
10.47
strong northerly
8.66
powerful legacy
9.76
strong showings
8.58
powerful tool
9.3
strong believer
8.35
powerful storms
15