P - University College London

Download Report

Transcript P - University College London

Simple Statistics for
Corpus Linguistics
Sean Wallis
Survey of English Usage
University College London
[email protected]
Outline
• Numbers…
• A simple research question
– do women speak or write more than men
in ICE-GB?
– p = proportion = probability
• Another research question
– what happens to speakers’ use of modal
shall vs. will over time?
– the idea of inferential statistics
– plotting confidence intervals
• Concluding remarks
Numbers...
• We are used to concepts like these being
expressed as numbers:
–
–
–
–
–
length (distance, height)
area
volume
temperature
wealth (income, assets)
Numbers...
• We are used to concepts like these being
expressed as numbers:
–
–
–
–
–
length (distance, height)
area
volume
temperature
wealth (income, assets)
• We are going to discuss another concept:
– probability
• proportion, percentage
– a simple idea, at the heart of statistics
Probability
• Based on another, even simpler, idea:
– probability p = x / n
Probability
• Based on another, even simpler, idea:
– probability p = x / n
– e.g. the probability that the
speaker says will instead of shall
Probability
• Based on another, even simpler, idea:
– probability p = x / n
• where
– e.g. the probability that the
speaker says will instead of shall
– frequency x (often, f )
• the number of times something actually happens
• the number of hits in a search
Probability
• Based on another, even simpler, idea:
– probability p = x / n
• where
– e.g. the probability that the
speaker says will instead of shall
– frequency x (often, f ) – cases of will
• the number of times something actually happens
• the number of hits in a search
Probability
• Based on another, even simpler, idea:
– probability p = x / n
• where
– e.g. the probability that the
speaker says will instead of shall
– frequency x (often, f ) – cases of will
• the number of times something actually happens
• the number of hits in a search
– baseline n is
• the number of times something could happen
• the number of hits
– in a more general search
– in several alternative patterns (‘alternate forms’)
Probability
• Based on another, even simpler, idea:
– probability p = x / n
• where
– e.g. the probability that the
speaker says will instead of shall
– frequency x (often, f ) – cases of will
• the number of times something actually happens
• the number of hits in a search
– baseline n is
– total: will + shall
• the number of times something could happen
• the number of hits
– in a more general search
– in several alternative patterns (‘alternate forms’)
Probability
• Based on another, even simpler, idea:
– probability p = x / n
• where
– e.g. the probability that the
speaker says will instead of shall
– frequency x (often, f ) – cases of will
• the number of times something actually happens
• the number of hits in a search
– baseline n is
– total: will + shall
• the number of times something could happen
• the number of hits
– in a more general search
– in several alternative patterns (‘alternate forms’)
• Probability can range from 0 to 1
What can a corpus tell us?
• A corpus is a source of knowledge
about language:
–
–
–
–
corpus
introspection/observation/elicitation
controlled laboratory experiment
computer simulation
What can a corpus tell us?
• A corpus is a source of knowledge
about language:
–
–
–
–
corpus
introspection/observation/elicitation
controlled laboratory experiment
computer simulation
How do
these differ
in what
they might
tell us?
}
What can a corpus tell us?
• A corpus is a source of knowledge
about language:
–
–
–
–
corpus
introspection/observation/elicitation
controlled laboratory experiment
computer simulation
• A corpus is a sample of language
How do
these differ
in what
they might
tell us?
}
What can a corpus tell us?
• A corpus is a source of knowledge
about language:
–
–
–
–
corpus
introspection/observation/elicitation
controlled laboratory experiment
computer simulation
}
• A corpus is a sample of language,
varying by:
–
–
–
–
How do
these differ
in what
they might
tell us?
source (e.g. speech vs. writing, age...)
levels of annotation (e.g. parsing)
size (number of words)
sampling method (random sample?)
What can a corpus tell us?
• A corpus is a source of knowledge
about language:
–
–
–
–
corpus
introspection/observation/elicitation
controlled laboratory experiment
computer simulation
}
• A corpus is a sample of language,
varying by:
–
–
–
–
How do
these differ
in what
they might
tell us?
source (e.g. speech vs. writing, age...)
levels of annotation (e.g. parsing)
size (number of words)
sampling method (random sample?)
How does
this affect
the types of
knowledge
we might
obtain?
}
What can a parsed corpus tell us?
• Three kinds of evidence may be found in a
parsed corpus:
What can a parsed corpus tell us?
• Three kinds of evidence may be found in a
parsed corpus:
 Frequency evidence of a particular
known rule, structure or linguistic event
- How often?
What can a parsed corpus tell us?
• Three kinds of evidence may be found in a
parsed corpus:
 Frequency evidence of a particular
known rule, structure or linguistic event
- How often?
 Factual evidence of new rules, etc.
- How novel?
What can a parsed corpus tell us?
• Three kinds of evidence may be found in a
parsed corpus:
 Frequency evidence of a particular
known rule, structure or linguistic event
- How often?
 Factual evidence of new rules, etc.
- How novel?
 Interaction evidence of relationships
between rules, structures and events - Does X affect Y?
What can a parsed corpus tell us?
• Three kinds of evidence may be found in a
parsed corpus:
 Frequency evidence of a particular
known rule, structure or linguistic event
- How often?
 Factual evidence of new rules, etc.
- How novel?
 Interaction evidence of relationships
between rules, structures and events - Does X affect Y?
• Lexical searches may also be made more
precise using the grammatical analysis
A simple research question
• Let us consider the following question:
• Do women speak or write more words
than men in the ICE-GB corpus?
• What do you think?
• How might we find out?
Lets get some data
• Open ICE-GB with ICECUP
– Text Fragment query for words:
• “*+<{~PUNC,~PAUSE}>”
• counts every word, excluding pauses
and punctuation
Lets get some data
• Open ICE-GB with ICECUP
– Text Fragment query for words:
• “*+<{~PUNC,~PAUSE}>”
• counts every word, excluding pauses
and punctuation
– Variable query:
• TEXT CATEGORY = spoken, written
Lets get some data
• Open ICE-GB with ICECUP
– Text Fragment query for words:
• “*+<{~PUNC,~PAUSE}>”
• counts every word, excluding pauses
and punctuation
– Variable query:
• TEXT CATEGORY = spoken, written
}
– Variable query:
• SPEAKER GENDER = f, m, <unknown>
combine these
3 queries
Lets get some data
• Open ICE-GB with ICECUP
– Text Fragment query for words:
• “*+<{~PUNC,~PAUSE}>”
• counts every word, excluding pauses
and punctuation
– Variable query:
• TEXT CATEGORY = spoken, written
}
combine these
3 queries
– Variable query:
• SPEAKER GENDER = f, m, <unknown>
F
TOTAL
spoken
written
M
275,999
174,499
101,500
<unknown> TOTAL
667,934
93,355 1,037,288
439,741
1,076 615,316
228,193
92,279 421,972
ICE-GB: gender / written-spoken
• Proportion of words in each category
spoken/written by women and men
– The authors of some texts are unspecified
– Some written material may be jointly authored
female
written
male
spoken
TOTAL
0
0.2
p1
– female/male ratio varies slightly
0.4
0.6
0.8
ICE-GB: gender / written-spoken
• Proportion of words in each category
spoken/written by women and men
– The authors of some texts are unspecified
– Some written material may be jointly authored
female
written
spoken
p(female) = words spoken by women /
total words (excluding <unknown>)
male
TOTAL
0
0.2
p1
– female/male ratio varies slightly
0.4
0.6
0.8
p = Probability = Proportion
• We asked ourselves the following question:
– Do women speak or write more words than
men in the ICE-GB corpus?
– To answer this we looked at the proportion of
words in ICE-GB that are produced by women
(out of all words where the gender is known)
p = Probability = Proportion
• We asked ourselves the following question:
– Do women speak or write more words than
men in the ICE-GB corpus?
– To answer this we looked at the proportion of
words in ICE-GB that are produced by women
(out of all words where the gender is known)
• The proportion of words produced by women
can also be thought of as a probability:
– What is the probability that, if we were to
pick any random word in ICE-GB (and the
gender was known) it would be uttered by a
woman?
Another research question
• Let us consider the following question:
• What happens to modal shall vs. will
over time in British English?
– Does shall increase or decrease?
• What do you think?
• How might we find out?
Lets get some data
• Open DCPSE with ICECUP
– FTF query for first person declarative shall:
• repeat for will
Lets get some data
• Open DCPSE with ICECUP
– FTF query for first person declarative shall:
• repeat for will
– Corpus Map:
• DATE
}
Do the first set of queries and
then drop into Corpus Map
Modal shall vs. will over time
• Plotting probability of speaker selecting modal
shall out of shall/will over time (DCPSE)
1.0
p(shall | {shall, will})
shall = 100%
0.8
0.6
0.4
0.2
shall = 0%
0.0
1955
1960
1965
1970
1975
1980
1985
1990
1995
(Aarts et al. 2013)
Modal shall vs. will over time
• Plotting probability of speaker selecting modal
shall out of shall/will over time (DCPSE)
1.0
p(shall | {shall, will})
shall = 100%
0.8
0.6
0.4
0.2
shall = 0%
0.0
1955
1960
1965
1970
1975
1980
1985
1990
1995
(Aarts et al. 2013)
Modal shall vs. will over time
• Plotting probability of speaker selecting modal
shall out of shall/will over time (DCPSE)
1.0
p(shall | {shall, will})
shall = 100%
0.8
0.6
0.4
Is shall going up or down?
0.2
shall = 0%
0.0
1955
1960
1965
1970
1975
1980
1985
1990
1995
(Aarts et al. 2013)
Is shall going up or down?
• Whenever we look at change, we must ask
ourselves two things:
Is shall going up or down?
• Whenever we look at change, we must ask
ourselves two things:
What is the change relative to?
– Is our observation higher or lower than we might
expect?
• In this case we ask
• Does shall decrease relative to shall +will ?
Is shall going up or down?
• Whenever we look at change, we must ask
ourselves two things:
What is the change relative to?
– Is our observation higher or lower than we might
expect?
• In this case we ask
• Does shall decrease relative to shall +will ?
How confident are we in our results?
– Is the change big enough to be reproducible?
The idea of a confidence interval
• All observations are imprecise
– Randomness is a fact of life
– Our abilities are finite:
• to measure accurately or
• reliably classify into types
• We need to express caution in citing numbers
• Example (from Levin 2013):
– 77.27% of uses of think in 1920s data
have a literal (‘cogitate’) meaning
The idea of a confidence interval
• All observations are imprecise
– Randomness is a fact of life
– Our abilities are finite:
• to measure accurately or
• reliably classify into types
• We need to express caution in citing numbers
• Example (from Levin 2013):
– 77.27% of uses of think in 1920s data
have a literal (‘cogitate’) meaning
Really? Not 77.28, or 77.26?
The idea of a confidence interval
• All observations are imprecise
– Randomness is a fact of life
– Our abilities are finite:
• to measure accurately or
• reliably classify into types
• We need to express caution in citing numbers
• Example (from Levin 2013):
– 77% of uses of think in 1920s data
have a literal (‘cogitate’) meaning
The idea of a confidence interval
• All observations are imprecise
– Randomness is a fact of life
– Our abilities are finite:
• to measure accurately or
• reliably classify into types
• We need to express caution in citing numbers
• Example (from Levin 2013):
– 77% of uses of think in 1920s data
have a literal (‘cogitate’) meaning
Sounds defensible. But how confident
can we be in this number?
The idea of a confidence interval
• All observations are imprecise
– Randomness is a fact of life
– Our abilities are finite:
• to measure accurately or
• reliably classify into types
• We need to express caution in citing numbers
• Example (from Levin 2013):
– 77% (66-86%*) of uses of think in 1920s
data have a literal (‘cogitate’) meaning
The idea of a confidence interval
• All observations are imprecise
– Randomness is a fact of life
– Our abilities are finite:
• to measure accurately or
• reliably classify into types
• We need to express caution in citing numbers
• Example (from Levin 2013):
– 77% (66-86%*) of uses of think in 1920s
data have a literal (‘cogitate’) meaning
Finally we have a credible range of values needs a footnote* to explain how it was
calculated.
The ‘sample’ and the ‘population’
• We said that the corpus was a sample
The ‘sample’ and the ‘population’
• We said that the corpus was a sample
• Previously, we asked about the proportions of
male/female words in the corpus (ICE-GB)
– We asked questions about the sample
– The answers were statements of fact
The ‘sample’ and the ‘population’
• We said that the corpus was a sample
• Previously, we asked about the proportions of
male/female words in the corpus (ICE-GB)
– We asked questions about the sample
– The answers were statements of fact
• Now we are asking about “British English”
?
The ‘sample’ and the ‘population’
• We said that the corpus was a sample
• Previously, we asked about the proportions of
male/female words in the corpus (ICE-GB)
– We asked questions about the sample
– The answers were statements of fact
• Now we are asking about “British English”
– We want to draw an inference
• from the sample (in this case, DCPSE)
• to the population (similarly-sampled BrE utterances)
– This inference is a best guess
– This process is called inferential statistics
Basic inferential statistics
• Suppose we carry out an experiment
– We toss a coin 10 times and get 5 heads
– How confident are we in the results?
• Suppose we repeat the experiment
• Will we get the same result again?
Basic inferential statistics
• Suppose we carry out an experiment
– We toss a coin 10 times and get 5 heads
– How confident are we in the results?
• Suppose we repeat the experiment
• Will we get the same result again?
• Let’s try…
–
–
–
–
You should have one coin
Toss it 10 times
Write down how many heads you get
Do you all get the same results?
The Binomial distribution
• Repeated sampling tends to form a Binomial
distribution around the expected mean X
F
• We toss a coin
10 times, and
get 5 heads
N=1
X
x
1
3
5
7
9
The Binomial distribution
• Repeated sampling tends to form a Binomial
distribution around the expected mean X
F
• Due to chance,
some samples
will have a
higher or lower
score
N=4
X
x
1
3
5
7
9
The Binomial distribution
• Repeated sampling tends to form a Binomial
distribution around the expected mean X
F
• Due to chance,
some samples
will have a
higher or lower
score
N=8
X
x
1
3
5
7
9
The Binomial distribution
• Repeated sampling tends to form a Binomial
distribution around the expected mean X
F
• Due to chance,
some samples
will have a
higher or lower
score
N = 12
X
x
1
3
5
7
9
The Binomial distribution
• Repeated sampling tends to form a Binomial
distribution around the expected mean X
F
• Due to chance,
some samples
will have a
higher or lower
score
N = 16
X
x
1
3
5
7
9
The Binomial distribution
• Repeated sampling tends to form a Binomial
distribution around the expected mean X
F
• Due to chance,
some samples
will have a
higher or lower
score
N = 20
X
x
1
3
5
7
9
The Binomial distribution
• Repeated sampling tends to form a Binomial
distribution around the expected mean X
F
• Due to chance,
some samples
will have a
higher or lower
score
N = 26
X
x
1
3
5
7
9
The Binomial distribution
• It is helpful to express x as the probability of
choosing a head, p, with expected mean P
• p=x/n
– n = max. number of
possible heads (10)
F
• Probabilities are in
the range 0 to 1
= percentages
(0 to 100%)
P
p
0.1
0.3
0.5
0.7
0.9
The Binomial distribution
• Take-home point:
– A single observation, say x hits (or p as a
proportion of n possible hits) in the corpus, is
not guaranteed to be correct ‘in the world’!
• Estimating the
confidence you
have in your results
is essential
F
p
P
p
0.1
0.3
0.5
0.7
0.9
The Binomial distribution
• Take-home point:
– A single observation, say x hits (or p as a
proportion of n possible hits) in the corpus, is
not guaranteed to be correct ‘in the world’!
• Estimating the
confidence you
have in your results
is essential
F
p
– We want to make
predictions about
future runs of the
same experiment
P
p
0.1
0.3
0.5
0.7
0.9
Binomial  Normal
• The Binomial (discrete) distribution is close to
the Normal (continuous) distribution
F
x
0.1
0.3
0.5
0.7
0.9
The central limit theorem
• Any Normal distribution can be defined by
only two variables and the Normal function z
F
 population
mean P
 standard deviation
S =  P(1 – P) / n
z.S
0.1
0.3
– With more
data in the
experiment, S
will be smaller
z.S
0.5
0.7
p
The central limit theorem
• Any Normal distribution can be defined by
only two variables and the Normal function z
F
 population
mean P
 standard deviation
S =  P(1 – P) / n
z.S
z.S
– 95% of the curve is within ~2 standard
deviations of the expected mean
2.5%
2.5%
– the correct figure
is 1.95996!
95%
0.1
0.3
0.5
0.7
p
= the critical value
of z for an error
level of 0.05.
The single-sample z test...
• Is an observation p > z standard deviations
from the expected (population) mean P?
F
observation p
z.S
2.5%
0.1
z.S
2.5%
P
0.3
0.5
• If yes, p is
significantly
different
from P
0.7
p
...gives us a “confidence interval”
• P ± z . S is the confidence interval for P
– We want to plot the interval about p
F
z.S
2.5%
0.1
z.S
2.5%
P
0.3
0.5
0.7
p
...gives us a “confidence interval”
• P ± z . S is the confidence interval for P
– We want to plot the interval about p
observation p
F
w– w+
95%
2.5%
0.1
P
0.3
0.5
2.5%
0.7
p
...gives us a “confidence interval”
• The interval about
p is called the
Wilson score interval
observation p
• This interval
reflects the
Normal interval
about P:
F
w–
• If P is at the upper
limit of p,
p is at the lower
limit of P
w+
P
2.5%
0.1
0.3
0.5
(Wallis, 2013)
2.5%
0.7
p
Modal shall vs. will over time
• Simple test:
– Compare p for
• all LLC texts in DCPSE (1956-77) with
• all ICE-GB texts (early 1990s)
– We get the following data
shall
will
total
LLC
ICE-GB
110
40
78
58
188
98
total
150
136
286
– We may plot the probability
of shall being selected,
with Wilson intervals
1.0
p(shall | {shall, will})
0.8
0.6
0.4
0.2
0.0
LLC
ICE-GB
Modal shall vs. will over time
• Simple test:
– Compare p for
May be input in a
2 x 2 chi-square
• all LLC texts in DCPSE (1956-77) with
test
• all ICE-GB texts (early 1990s)
– We get the following data
shall
will
total
LLC
ICE-GB
110
40
78
58
188
98
total
150
136
286
1.0
p(shall | {shall, will})
0.8
0.6
– We may plot the probability 0.4
of shall being selected,
0.2
with Wilson intervals
0.0
- or you can check Wilson
intervals
LLC
ICE-GB
Modal shall vs. will over time
• Plotting modal shall/will over time (DCPSE)
1.0
p(shall | {shall, will})
0.8
0.6
0.4
0.2
0.0
1955
1960
1965
1970
1975
1980
1985
1990
1995
• Small amounts of
data / year
Modal shall vs. will over time
• Plotting modal shall/will over time (DCPSE)
1.0
p(shall | {shall, will})
• Confidence
intervals identify
the degree of
certainty in our
results
0.8
0.6
0.4
0.2
0.0
1955
• Small amounts of
data / year
1960
1965
1970
1975
1980
1985
1990
1995
Modal shall vs. will over time
• Plotting modal shall/will over time (DCPSE)
1.0
p(shall | {shall, will})
• Confidence
intervals identify
the degree of
certainty in our
results
0.8
0.6
0.4
• Highly skewed p
in some cases
0.2
0.0
1955
• Small amounts of
data / year
–
1960
1965
1970
1975
1980
1985
1990
1995
p = 0 or 1
(circled)
Modal shall vs. will over time
• Plotting modal shall/will over time (DCPSE)
1.0
p(shall | {shall, will})
• Confidence
intervals identify
the degree of
certainty in our
results
0.8
0.6
0.4
• We can now
estimate an
approximate
downwards
curve
0.2
0.0
1955
• Small amounts of
data / year
1960
1965
1970
1975
1980
1985
1990
1995
(Aarts et al. 2013)
Recap
• Whenever we look at change, we must ask
ourselves two things:
What is the change relative to?
– Is our observation higher or lower than we might
expect?
• In this case we ask
• Does shall decrease relative to shall +will ?
How confident are we in our results?
– Is the change big enough to be reproducible?
Conclusions
• An observation is not the actual value
– Repeating the experiment might get different results
• The basic idea of these methods is
– Predict range of future results if experiment was repeated
• ‘Significant’ = effect > 0 (e.g. 19 times out of 20)
• Based on the Binomial distribution
– Approximated by Normal distribution – many uses
• Plotting confidence intervals
• Use goodness of fit or single-sample z tests to compare an
observation with an expected baseline
• Use 22 tests or two independent sample z tests to compare
two observed samples
References
• Aarts, B., J. Close, G. Leech and S.A. Wallis (eds). The Verb
Phrase in English: Investigating recent language change with
corpora. Cambridge: CUP.
– Aarts, B., Close, J., and Wallis, S.A. 2013. Choices over time:
methodological issues in investigating current change. Chapter 2.
– Levin, M. 2013. The progressive in modern American English.
Chapter 8.
• Wallis, S.A. 2013. Binomial confidence intervals and contingency
tests. Journal of Quantitative Linguistics 20:3, 178-208.
• Wilson, E.B. 1927. Probable inference, the law of succession, and
statistical inference. Journal of the American Statistical
Association 22: 209-212.
•
NOTE: Statistics papers, more explanation, spreadsheets etc. are
published on corp.ling.stats blog: http://corplingstats.wordpress.com