Capturing linguistic interaction in a grammar

Download Report

Transcript Capturing linguistic interaction in a grammar

MA in English Linguistics
Experimental design and statistics
Sean Wallis
Survey of English Usage
University College London
[email protected]
Outline
• What is a research question?
• Choice and baselines
• Making sense of probability
• Observing change in a corpus
• Drawing inferences to larger populations
• Estimating error in observations
• Testing results for significance
What is a research question?
• You may have heard this phrase last term
• What do you think we mean by
a “research question”?
• Can you think of any examples?
Examples
• Some example research questions
Examples
• Some example research questions
– smoking is good for you
Examples
• Some example research questions
– smoking is good for you
– dropped objects accelerate toward the ground at
9.8 metres per second squared
Examples
• Some example research questions
– smoking is good for you
– dropped objects accelerate toward the ground at
9.8 metres per second squared
– ’s is a clitic rather than a word
Examples
• Some example research questions
– smoking is good for you
– dropped objects accelerate toward the ground at
9.8 metres per second squared
– ’s is a clitic rather than a word
– the word shall is used less often in recent years
Examples
• Some example research questions
– smoking is good for you
– dropped objects accelerate toward the ground at
9.8 metres per second squared
– ’s is a clitic rather than a word
– the word shall is used less often in recent years
– the degree of preference for shall rather
than will has declined in British English
over the period 1960s-1990s
Testable hypotheses
• An hypothesis = a testable research question
• Compare
– the word shall is used less in recent years
to
– the degree of preference for shall rather
than will has declined in British English over
the period 1960s-1990s
• How could you test these hypotheses?
Questions of choice
• Suppose we wanted to test the following
hypothesis using DCPSE
– the word shall is used less in recent years
• When we say the word shall is used less...
– ...less compared to what?
• traditionally corpus linguists have “normalised” data as
a proportion of words (so we might say shall is used
less frequently per million words)
• But what might this mean?
Questions of choice
• From the speaker’s perspective:
– The probability of a speaker using a word like shall depends
on whether they had the opportunity to say it in the
first place
– They were about to say will, but said shall instead
Questions of choice
• From the speaker’s perspective:
– The probability of a speaker using a word like shall depends
on whether they had the opportunity to say it in the
first place
– They were about to say will, but said shall instead
– Per million words might still be relevant from the hearer’s
perspective
Questions of choice
• From the speaker’s perspective:
– The probability of a speaker using a word like shall depends
on whether they had the opportunity to say it in the
first place
– They were about to say will, but said shall instead
– Per million words might still be relevant from the hearer’s
perspective
• If we can identify all points where the choice
arose, we have an ideal baseline for studying
linguistic choices made by speakers/writers.
Questions of choice
• From the speaker’s perspective:
– The probability of a speaker using a word like shall depends
on whether they had the opportunity to say it in the
first place
– They were about to say will, but said shall instead
– Per million words might still be relevant from the hearer’s
perspective
• If we can identify all points where the choice
arose, we have an ideal baseline for studying
linguistic choices made by speakers/writers.
– Can all cases of will be replaced by shall ?
– What about second or third person shall ?
Baselines
• The baseline is a central element of the
hypothesis
– Changes are always relative to something
– You can get different results with different baselines
– Different baselines imply different conclusions
• We have seen two different kinds of baselines
– A word baseline
• shall per million words
– A choice baseline (an “alternation experiment”)
• shall as a proportion of the choice shall vs. will (including’ll ),
when the choice arises
Baselines
• In many cases it is very difficult to identify all cases
where “the choice” arises
– e.g. studying modal verbs
Baselines
• In many cases it is very difficult to identify all cases
where “the choice” arises
– e.g. studying modal verbs
• You may need to pick a different baseline
– Be as specific as you can
• words  VPs  tensed VPs  alternating modals
Baselines
• In many cases it is very difficult to identify all cases
where “the choice” arises
– e.g. studying modal verbs
alternation =
“different words,
same meaning”
– Be as specific as you can
• words  VPs  tensed VPs  alternating modals
• You may need to pick a different baseline
Baselines
• In many cases it is very difficult to identify all cases
where “the choice” arises
– e.g. studying modal verbs
alternation =
“different words,
same meaning”
– Be as specific as you can
• words  VPs  tensed VPs  alternating modals
• You may need to pick a different baseline
• Other hypotheses imply different baselines:
– Different meanings of the same word:
• e.g. uses of very, as a proportion of all cases of very
very +N
- the very person
very +ADJ - the very tall person
very +ADV - very slightly moving
}
semasiological
variation
Probability
• We are used to concepts like these being
expressed as numbers:
–
–
–
–
–
length (distance, height)
area
volume
temperature
wealth (income, assets)
Probability
• We are used to concepts like these being
expressed as numbers:
–
–
–
–
–
length (distance, height)
area
volume
temperature
wealth (income, assets)
• We are going to discuss another concept:
– probability (proportion, percentage)
Probability
• Based on another, even simpler, idea:
– probability p = x / n
Probability
• Based on another, even simpler, idea:
– probability p = x / n
– e.g. the probability that the
speaker says will instead of shall
Probability
• Based on another, even simpler, idea:
– probability p = x / n
• where
– e.g. the probability that the
speaker says will instead of shall
– frequency x (often, f )
• the number of times something actually happens
• the number of hits in a search
Probability
• Based on another, even simpler, idea:
– probability p = x / n
• where
– e.g. the probability that the
speaker says will instead of shall
– frequency x (often, f ) – cases of will
• the number of times something actually happens
• the number of hits in a search
Probability
• Based on another, even simpler, idea:
– probability p = x / n
• where
– e.g. the probability that the
speaker says will instead of shall
– frequency x (often, f ) – cases of will
• the number of times something actually happens
• the number of hits in a search
– baseline n is
• the number of times something could happen
• the number of hits
– in a more general search
– in several alternative patterns (‘alternate forms’)
Probability
• Based on another, even simpler, idea:
– probability p = x / n
• where
– e.g. the probability that the
speaker says will instead of shall
– frequency x (often, f ) – cases of will
• the number of times something actually happens
• the number of hits in a search
– baseline n is
– total: will + shall
• the number of times something could happen
• the number of hits
– in a more general search
– in several alternative patterns (‘alternate forms’)
Probability
• Based on another, even simpler, idea:
– probability p = x / n
• where
– e.g. the probability that the
speaker says will instead of shall
– frequency x (often, f ) – cases of will
• the number of times something actually happens
• the number of hits in a search
– baseline n is
– total: will + shall
• the number of times something could happen
• the number of hits
– in a more general search
– in several alternative patterns (‘alternate forms’)
• Probability can range from 0 to 1
A simple research question
• What happens to modal shall vs. will
over time in British English?
– Does shall increase or decrease?
• What do you think?
• How might we find out?
Lets get some data
• Open DCPSE with ICECUP
– FTF query for first person declarative shall:
• repeat for will
Lets get some data
• Open DCPSE with ICECUP
– FTF query for first person declarative shall:
• repeat for will
– Corpus Map:
• DATE
}
Do the first set of queries and
then drop into Corpus Map
Modal shall vs. will over time
• Plotting probability of speaker selecting modal
shall out of shall/will over time (DCPSE)
1.0
p(shall | {shall, will})
shall = 100%
0.8
0.6
0.4
0.2
shall = 0%
0.0
1955
1960
1965
1970
1975
1980
1985
1990
1995
(Aarts et al., 2013)
Modal shall vs. will over time
• Plotting probability of speaker selecting modal
shall out of shall/will over time (DCPSE)
1.0
p(shall | {shall, will})
shall = 100%
0.8
0.6
0.4
Is shall going up or down?
0.2
shall = 0%
0.0
1955
1960
1965
1970
1975
1980
1985
1990
1995
(Aarts et al., 2013)
Is shall going up or down?
• Whenever we look at change, we must ask
ourselves two things:
Is shall going up or down?
• Whenever we look at change, we must ask
ourselves two things:
What is the change relative to?
– What is our baseline for comparison?
– In this case we ask
• Does shall decrease relative to shall +will ?
Is shall going up or down?
• Whenever we look at change, we must ask
ourselves two things:
What is the change relative to?
– What is our baseline for comparison?
– In this case we ask
• Does shall decrease relative to shall +will ?
How confident are we in our results?
– Is the change big enough to be reproducible?
The ‘sample’ and the ‘population’
• The corpus is a sample
The ‘sample’ and the ‘population’
• The corpus is a sample
• If we ask questions about the proportions of
certain words in the corpus
– We ask questions about the sample
– Answers are statements of fact
The ‘sample’ and the ‘population’
• The corpus is a sample
• If we ask questions about the proportions of
certain words in the corpus
– We ask questions about the sample
– Answers are statements of fact
• Now we are asking about “British English”
?
The ‘sample’ and the ‘population’
• The corpus is a sample
• If we ask questions about the proportions of
certain words in the corpus
– We ask questions about the sample
– Answers are statements of fact
• Now we are asking about “British English”
– We want to draw an inference
• from the sample (in this case, DCPSE)
• to the population (similarly-sampled BrE utterances)
– This inference is a best guess
– This process is called inferential statistics
Basic inferential statistics
• Suppose we carry out an experiment
– We toss a coin 10 times and get 5 heads
– How confident are we in the results?
• Suppose we repeat the experiment
• Will we get the same result again?
Basic inferential statistics
• Suppose we carry out an experiment
– We toss a coin 10 times and get 5 heads
– How confident are we in the results?
• Suppose we repeat the experiment
• Will we get the same result again?
• Let’s try…
–
–
–
–
You should have one coin
Toss it 10 times
Write down how many heads you get
Do you all get the same results?
The Binomial distribution
• Repeated sampling tends to form a Binomial
distribution around the expected mean X
F
• We toss a coin
10 times, and
get 5 heads
N=1
X
x
1
3
5
7
9
The Binomial distribution
• Repeated sampling tends to form a Binomial
distribution around the expected mean X
F
• Due to chance,
some samples
will have a
higher or lower
score
N=4
X
x
1
3
5
7
9
The Binomial distribution
• Repeated sampling tends to form a Binomial
distribution around the expected mean X
F
• Due to chance,
some samples
will have a
higher or lower
score
N=8
X
x
1
3
5
7
9
The Binomial distribution
• Repeated sampling tends to form a Binomial
distribution around the expected mean X
F
• Due to chance,
some samples
will have a
higher or lower
score
N = 12
X
x
1
3
5
7
9
The Binomial distribution
• Repeated sampling tends to form a Binomial
distribution around the expected mean X
F
• Due to chance,
some samples
will have a
higher or lower
score
N = 16
X
x
1
3
5
7
9
The Binomial distribution
• Repeated sampling tends to form a Binomial
distribution around the expected mean X
F
• Due to chance,
some samples
will have a
higher or lower
score
N = 20
X
x
1
3
5
7
9
The Binomial distribution
• Repeated sampling tends to form a Binomial
distribution around the expected mean X
F
• Due to chance,
some samples
will have a
higher or lower
score
N = 26
X
x
1
3
5
7
9
The Binomial distribution
• It is helpful to express x as the probability of
choosing a head, p, with expected mean P
• p=x/n
– n = max. number of
possible heads (10)
F
• Probabilities are in
the range 0 to 1
= percentages
(0 to 100%)
P
p
0.1
0.3
0.5
0.7
0.9
The Binomial distribution
• Take-home point:
– A single observation, say x hits (or p as a
proportion of n possible hits) in the corpus, is
not guaranteed to be correct ‘in the world’!
• Estimating the
confidence you
have in your results
is essential
F
p
P
p
0.1
0.3
0.5
0.7
0.9
The Binomial distribution
• Take-home point:
– A single observation, say x hits (or p as a
proportion of n possible hits) in the corpus, is
not guaranteed to be correct ‘in the world’!
• Estimating the
confidence you
have in your results
is essential
F
p
– We want to make
predictions about
future runs of the
same experiment
P
p
0.1
0.3
0.5
0.7
0.9
Binomial  Normal
• The Binomial (discrete) distribution is close to
the Normal (continuous) distribution
F
x
0.1
0.3
0.5
0.7
0.9
Binomial  Normal
• Any Normal distribution can be defined by
only two variables and the Normal function z
F
 population
mean P
 standard deviation
S =  P(1 – P) / n
z.S
0.1
0.3
– With more
data in the
experiment, S
will be smaller
z.S
0.5
0.7
p
Binomial  Normal
• Any Normal distribution can be defined by
only two variables and the Normal function z
F
 population
mean P
 standard deviation
S =  P(1 – P) / n
z.S
z.S
– 95% of the curve is within ~2 standard
deviations of the expected mean
2.5%
2.5%
– the correct figure
is 1.95996!
95%
0.1
0.3
0.5
0.7
p
= the critical value
of z for an error
level of 0.05.
The single-sample z test...
• Is an observation p > z standard deviations
from the expected (population) mean P?
F
observation p
z.S
0.25%
0.1
z.S
0.25%
P
0.3
0.5
• If yes, p is
significantly
different
from P
0.7
p
...gives us a “confidence interval”
• P ± z . S is the confidence interval for P
– We want to plot the interval about p
F
z.S
0.25%
0.1
z.S
0.25%
P
0.3
0.5
0.7
p
...gives us a “confidence interval”
• P ± z . S is the confidence interval for P
– We want to plot the interval about p
observation p
F
w– w+
P
0.25%
0.1
0.3
0.5
0.25%
0.7
p
...gives us a “confidence interval”
• The interval about
p is called the
Wilson score interval
observation p
• This interval
reflects the
Normal interval
about P:
F
w–
• If P is at the upper
limit of p,
p is at the lower
limit of P
w+
P
0.25%
0.1
0.3
0.5
(Wallis, 2013)
0.25%
0.7
p
Modal shall vs. will over time
• Simple test:
– Compare p for
• all LLC texts in DCPSE (1956-77) with
• all ICE-GB texts (early 1990s)
– We get the following data
shall
will
total
LLC
ICE-GB
110
40
78
58
188
98
total
150
136
286
– We may plot the probability
of shall being selected,
with Wilson intervals
1.0
p(shall | {shall, will})
0.8
0.6
0.4
0.2
0.0
LLC
ICE-GB
Modal shall vs. will over time
• Simple test:
– Compare p for
May be input in a
2 x 2 chi-square
• all LLC texts in DCPSE (1956-77) with
test
• all ICE-GB texts (early 1990s)
– We get the following data
shall
will
total
LLC
ICE-GB
110
40
78
58
188
98
total
150
136
286
1.0
p(shall | {shall, will})
0.8
0.6
– We may plot the probability 0.4
of shall being selected,
0.2
with Wilson intervals
0.0
- or you can check Wilson
intervals
LLC
ICE-GB
Modal shall vs. will over time
• Plotting modal shall/will over time (DCPSE)
1.0
p(shall | {shall, will})
0.8
0.6
0.4
0.2
0.0
1955
1960
1965
1970
1975
1980
1985
1990
1995
• Small amounts of
data / year
Modal shall vs. will over time
• Plotting modal shall/will over time (DCPSE)
1.0
p(shall | {shall, will})
• Confidence
intervals identify
the degree of
certainty in our
results
0.8
0.6
0.4
0.2
0.0
1955
• Small amounts of
data / year
1960
1965
1970
1975
1980
1985
1990
1995
Modal shall vs. will over time
• Plotting modal shall/will over time (DCPSE)
1.0
p(shall | {shall, will})
• Confidence
intervals identify
the degree of
certainty in our
results
0.8
0.6
0.4
• Highly skewed p
in some cases
0.2
0.0
1955
• Small amounts of
data / year
–
1960
1965
1970
1975
1980
1985
1990
1995
p = 0 or 1
(circled)
Modal shall vs. will over time
• Plotting modal shall/will over time (DCPSE)
1.0
p(shall | {shall, will})
• Confidence
intervals identify
the degree of
certainty in our
results
0.8
0.6
0.4
• We can now
estimate an
approximate
downwards
curve
0.2
0.0
1955
• Small amounts of
data / year
1960
1965
1970
1975
1980
1985
1990
1995
(Aarts et al., 2013)
Recap
• Whenever we look at change, we must ask
ourselves two things:
What is the change relative to?
– Is our observation higher or lower than we might
expect
• In this case we ask
• Does shall decrease relative to shall +will ?
How confident are we in our results?
– Is the change big enough to be reproducible?
Conclusions
• An observation is not the actual value
– Repeating the experiment might get different results
• The basic idea of inferential statistics is
– Predict range of future results if experiment was repeated
• ‘Significant’ = effect > 0 (e.g. 19 times out of 20)
• Based on the Binomial distribution
– Approximated by Normal distribution – many uses
• Plotting confidence intervals
• Use goodness of fit or single-sample z tests to compare an
observation with an expected baseline
• Use 22 tests or independent-sample z tests to compare two
observed samples
References
• Aarts, B., Close, J., and Wallis, S.A. 2013. Choices over time:
methodological issues in investigating current change. Chapter 2
in Aarts, B. Close, J., Leech G., and Wallis, S.A. (eds.) The Verb
Phrase in English. Cambridge University Press.
• Wallis, S.A. 2013. Binomial confidence intervals and contingency
tests. Journal of Quantitative Linguistics 20:3, 178-208.
• Wilson, E.B. 1927. Probable inference, the law of succession, and
statistical inference. Journal of the American Statistical
Association 22: 209-212
•
NOTE: Statistics papers, more explanation, spreadsheets etc. are
published on corp.ling.stats blog: http://corplingstats.wordpress.com