Collecting and interpreting acceptability judgments using

Download Report

Transcript Collecting and interpreting acceptability judgments using

Collecting and interpreting
acceptability judgments using
Magnitude Estimation
Caroline Heycock
with
Zakaris Svabo Hansen and Antonella Sorace
University of Edinburgh
NLVN-course/NORMS-seminar
Tórshavn, Faroe Islands, 8–16 August 2008
Outline
• Why do we need acceptability judgments?
• What are the problems with acceptability
judgments?
• How can Magnitude Estimation help with any of
these problems?
• Exemplification from ongoing studies on Faroese
(and related languages)
Examples
– aggregate across speakers
– include performance errors
– allow no straightforward distinction between nonoccurring and ungrammatical
– may not exist
Problems ME
• There is no direct way to access I-language (the
speaker’s knowledge of their language), we need
to triangulate from all available sources of data.
• Corpus data typically
Need
Why do we need judgment data?
Outline
• Why do we need acceptability judgments?
• What are the problems with acceptability
judgments?
• How can Magnitude Estimation help with any of
these problems?
• Exemplification from ongoing studies on Faroese
(and related languages)
Validity
Need
Judgments are also a type of behaviour, known to
be affected by
Problems ME
processing constraints
personality and mental state
presentation (order, context, mode)
absolute vs relative task
linguistic training
Examples
–
–
–
–
–
– This may or may not be considered a problem of
reliability, depending on assumptions about individual’s
grammars, but it is at least a methodological problem
• Intraspeaker inconsistency
Problems ME
• Interspeaker variation
Need
Reliability
Examples
Problems ME
Examples
• Judgments of linguistic acceptability usually form
category scales (ok/*) or limited ordinal scales
(ok/?/?*/*), (1,2,3,4,5)
• These scales require absolute rating judgments,
rather than relative ranking judgments
• Ordinal scales provide no information about the
relative distance between adjacent points on the
scale
Need
Conventional measurements of
acceptability
– These scales cannot be analysed using parametric statistics,
because this type of analysis requires the data to be on at least an
interval scale.
• Inconsistency
• Uninterpretability
– What do the middle points on a rating scale actually mean?
– How can we distinguish between lack of certainty and intermediate
acceptability?
Examples
– Even trained linguists use diacritics in different ways. Comparison
between different studies is extremely difficult.
Problems ME
• Limited in their range of values
• Lack of statistical power
Need
Problems arising with conventional
scales for acceptability judgments
Judgment data: interpreting midpoints
Thráinsson 2003, Petersen 2000
V-Adv
+bridge compl
-bridge compl
Relative
Indirect question
Adverbial clause
Adv-V
√
?
*
√
34%
33%
33%
75%
21%
4%
66%
7%
26%
92%
0%
8%
14%
41%
45%
82%
14%
4%
25%
6%
69%
98%
0%
2%
5%
31%
64%
81%
17%
2%
3%
0%
97%
100%
0%
0%
5%
32%
63%
74%
21%
5%
0%
0%
100%
100%
0%
0%
39%
37%
24%
81%
17%
2%
?
*
Judgment data: interpreting midpoints
Thráinsson 2003, Petersen 2000
V-Adv
√
+bridge compl
-bridge compl
Relative
Indirect question
Adverbial clause
?
Adv-V
*
√
?
*
34%
33%
33%
75%
21%
4%
66%
7%
26%
92%
0%
8%
14%
41%
45%
82%
14%
4%
25%
6%
69%
98%
0%
2%
5%
31%
64%
81%
17%
2%
3%
0%
97%
100%
0%
0%
5%
32%
63%
74%
21%
5%
0%
0%
100%
100%
0%
0%
39%
37%
24%
81%
17%
2%
Outline
• Why do we need acceptability judgments?
• What are the problems with acceptability
judgments?
• How can Magnitude Estimation help with any of
these problems?
• Exemplification from ongoing studies on Faroese
(and related languages)
Problems ME
Examples
• ME is an experimental technique used to determine
quickly and easily how much of a given sensation a person
is having.
• In an ME experiment subjects are presented with a
standard stimulus (a modulus) and are asked to express the
magnitude by a number.
• They are then presented with a series of stimuli that vary in
intensity and are asked to assign each of the stimuli a
number relative to the modulus.
Need
M[agnitude] E[stimation] in
psychophysics
Examples
– to the modulus to reflect magnitude of pertinent
characteristics (length, loudness, brightness)
– to each successive stimulus to indicate apparent
magnitude relative to the first (or to a previous
stimulus)
Problems ME
• Subjects assign a number:
Need
ME in psychophysics
Problems ME
Examples
• Scaling in ME is not about absolute accuracy of
judgments;
• Scaling is about the relative relationships between
judgments of stimuli of different intensities.
Need
ME in psychophysics: Scaling
Examples
• Other modalities can be more user-friendly
particularly if you are testing people who (think
they) are numerically-challenged.
Problems ME
• The numerical modality is the most common but
other modalities are possible (e.g. line length).
Need
ME in psychophysics: modalities
Problems ME
Examples
• Many magnitude estimation experiments use a
control condition in which subjects are asked to
perform magnitude estimations of the length of a
line.
• Magnitude estimations of line length have been
shown to be proportional to the actual length of
the lines.
Need
ME in psychophysics:
can people do it?
Problems ME
Examples
• Unlike other dimensions, linguistic acceptability
has no obvious “physical” continuum to plot
against subjects’ impressions.
• However, Bard, Robertson & Sorace 1996 have
applied standard cross-modality matching
techniques and were able to show that the
technique is reliable.
Need
ME in Linguistics
Problems ME
• Here’s an example of what the instructions look
like...
Need
Typical instructions
Examples
Instructions
The purpose of this exercise is to get you to judge
the acceptability of some English sentences. You
will see a series of sentences on the screen. These
sentences are all different. Some will seem
perfectly okay to you, but others will not. What
we're after is not what you think of the meaning of
the sentence, but what you think of the way it's
constructed.
• Your task is to judge how good or bad each
sentence is by assigning a number to it.
• You can use any number that seems appropriate
to you. For each sentence after the first, assign a
number to show how good or bad that sentence is
in proportion to the reference sentence.
For example, if the first sentence was:
(1) cat the mat on sat the.
and you gave it a 1, and if the next example:
(2) the dog the bone ate.
seemed 20 times better, you'd give it twenty. If
it seems half as good as the reference sentence,
give it the number 0.5
• You can use any range of positive numbers you
like including, if necessary, fractions or decimals.
• You should not restrict your responses to, say, an
academic marking scale.
• You may not use minus numbers or zero, of course,
because they aren't proper multiples or fractions
of positive numbers.
• If you forget the reference sentence don't worry; if
each of your judgments is in proportion to the
first, you can judge the new sentence relative to
any of them that you do remember.
• There are no 'correct' answers, so whatever seems
right to you is a valid response. Nor is there a
'correct' range of answers or a `correct` place to
start.
• Any convenient positive number will do for the
reference.
• We are interested in your first impressions, so
don't spend too long thinking about your
judgment.
Remember:
• Use any number you like for the first sentence.
• Judge each sentence in proportion to the reference
sentence.
• Use any positive numbers you think appropriate.
Problems ME
Examples
• The experimenter has the option of assigning a fixed
number to the modulus.
• Another option is to leave the modulus in sight throughout
the experiment.
• This option has good face validity, but it isn’t clear to what
extent it affects the ultimate reliability of the estimates.
• People don’t need to remember the modulus; if they are
making judgments proportionally, the reference point shifts
as they move on.
Need
Choices about the modulus:
face validity
Need
• The experimenter can impose constraints on the
randomization to prevent certain experimental
items from occurring consecutively.
• The modulus can be chosen to represent an
intermediate degree of acceptability.
• A number (or a line) of intermediate size can be
assigned to the modulus by the experimenter.
Problems ME
Examples
Advantages of quasi-randomization
Examples
• Intervals have to be different for non-native
speakers: they have to be piloted carefully.
Problems ME
• Timing the intervals between sentences may
reduce the likelihood that people consult
metalinguistic or prescriptive knowledge.
Need
Timed vs untimed ME
Problems ME
Examples
• There is a tendency in some people to use a fixed
(usually 10-point) scale. This is possibly because
of familiarity with school marking systems.
• If the instructions contain an explicit warning
against using a restricted range of numbers, the
tendency is much reduced.
• People are very sensitive to instructions: these
have to be as explicit and clear as possible.
• A detailed practice session is essential!
Need
Varying the instructions
Examples
– a direct indication of the speaker’s ability to
discriminate between more or less acceptable sentences
– a direct measure of the strength of speakers’ preferences
Problems ME
• ME yields interval scales, which allow the use of
parametric statistics
• Mathematical operations can be applied to the
estimates, allowing:
Need
Advantages
Need
• Informants are enabled to express their intuitions without
any restrictions of the judgment scale.
• They are asked to provide purely comparative judgments:
these are relative both to a reference item and the
individual subject’s own previous judgments.
• At no point is an absolute criterion of grammaticality
applied.
• The subjects themselves fix the value of the reference item
relative to which subsequent judgments are made.
Problems ME
Examples
Advantages
Problems ME
Examples
• The scale used by informants is open-ended and has no
minimum division: subjects can always add a further
highest score or produce an additional intermediate rating.
• The result is that subjects are able to produce judgments
which distinguish all and only the differences they
perceive.
Need
Advantages
Problems ME
Examples
ME data need to be normalized because people
use different ranges of estimates.
• Raw magnitude values are often transformed into
logs in order to yield a normal distribution.
• Each number is divided by the modulus that the
subject had assigned to the reference sentence, or
alternatively the z-scores are used.
• Any statistical package can easily do these
transformations.
Need
Data analysis: normalisation
Outline
• Why do we need acceptability judgments?
• What are the problems with acceptability
judgments?
• How can Magnitude Estimation help with any of
these problems?
• Exemplification from ongoing studies on Faroese
(and related languages)
Problems ME
Examples
Some questions:
1. Do current speakers of Faroese have V-to-I as part of
their competence grammar(s)?
that is, do they allow the order Finite Verb > Negation in
all types of subordinate clause?
2. Do current speakers of Faroese allow “generalised
embedded Verb Second” (V2)?
That is, do they allow a wide range of subordinate
clauses to begin with something other than the subject?
3. With respect to these phenomena, how is Faroese situated
with respect to Icelandic and Danish?
Need
Faroese
How acceptable is V-I in Faroese?
We looked at the effect of two variables and their
interaction (2 within-subjects variables, 2 and 3 levels):
• Order
– Verb-Adverb
– Adverb-Verb
• Type of “adverb”
– Negation (ikki)
– “High” adverb (kanska)
– “Low” adverb (ofta)
These orders were all contained in relative clauses.
Examples
• Adverb: Negation
Order: V-Adv
Hatta er filmurin, sum Hanus hevur ikki sæð
That is film-def that Hanus has neg seen
• Adverb: Negation
Order: Adv-V
Hetta er brævið, sum Elin ikki hevur lisið
That is letter-def that Elin neg has read
• Adverb: Low Adv
Order: V-Adv
Hetta er lagið,
sum Teitur hevur ofta spælt
That is piece-the that Teitur has often played
• Adverb: Low Adv
Order: Adv-V
Hatta er sangurin, sum Eivør ofta hevur sungið
That is song-def that Eivør often has sung
How “generalized” is V2 in Faroese?
We looked at the effect of two variables and their interaction (2 withinsubjects variables, 2 and 5 levels):
• Order
– Subject-Initial
– Adjunct-Initial
• Clause type
–
–
–
–
–
Main clause
“Bridge verb” complement
Nonbridge verb A complement (regret, admit)
Nonbridge verb B complement (deny, doubt, be proud)
Indirect question
Examples
•
•
•
•
Clause Type: Bridge
Order: Subject-Initial
Lív segði, at hon kom seint til arbeiðis í gjár
Lív said that she came late to work yesterday
Clause Type: Bridge
Order: Adjunct-Initial
Beinir segði, at í morgin kemur hann seint til arbeiðis
Beinir said that tomorrow comes he late to work
Clause Type: NonBridge B
Order: Subject-Initial
Sámal noktaði, at hann hevði verið alla náttina á barrini í fleiri førum
Sámal denied that he had been all night in bar-def frequently
Clause Type: NonBridge B
Order: Adjunct-Initial
Einar noktaði, at í fleiri forum hevði hann drukkið alla náttina á barrini
Einar deniedthat frequently had he drunk all night in bar-def
Faroese 1 vs Faroese 2: geographic?
• In Jonas 1996 it is argued that there are two distinct “dialects” in
Faroese:
– Faroese 1, which optionally allows V-to-I
– Faroese 2, which does not allow V-to-I
• Jonas suggests that these two dialects may correlate both with age and
with dialect area: Faroese 1 more common in the southern islands, and
among older speakers.
• We investigated the geographic dialect suggestion by collecting data
from 25 subjects from Tórshavn (North) and 22 subjects from Suðuroy
(South). Subjects were, as much as possible, matched for age.
Verb position: North v s South
0 .4
0 .3
z-scores (means)
0 .2
0 .1
N orth
South
0
V erb - A dverb
A dverb - V erb
- 0 .1
- 0 .2
- 0 .3
Posit ion of verb
No geographic dialect difference
• The main effect of dialect group was not
significant
• There was no significant interaction between
language group and position of verb, or between
language group and type of adverb
• We did not find any evidence for a geographic
dialect difference with respect to V-to-I in our
subjects
Commparison with Danish, Icelandic
• There is a significant interaction between language
and order of the verb with respect to
Negation/Adverb.
• I.e. the effect of the different orders is different,
depending on the language...
Position of the verb
0 .6
0 .5
0 .4
Estimated Marginal Means
0 .3
0 .2
0 .1
I c elandic
D anis h
F aroes e
0
V erb - A dverb
A dverb - V erb
- 0 .1
- 0 .2
- 0 .3
- 0 .4
- 0 .5
Posit ion
Comparing Verb/Adverb orders
• To see where there is any difference between the different
adverbs in terms of whether or not the verb can move past
them, we can look at the difference between the VerbAdverb and Adverb-Verb orders with respect to each of
the three adverbs
• We’d expect no difference between verb movement over
the three adverbs in Icelandic (all should be good) and in
Danish (all should be bad)
• If Faroese is just intermediate between Icelandic and
Danish, we’d also expect no effect of the different adverb
types here.
The effect of verb mov ement past different adv erbs
1 .0 0 0
0 .8 0 0
0 .6 0 0
0 .4 0 0
z-scores
0 .2 0 0
I c elandic
0 .0 0 0
D anis h
N egation
H igh A dverb
- 0 .2 0 0
- 0 .4 0 0
- 0 .6 0 0
- 0 .8 0 0
- 1 .0 0 0
Type of adverb
L ow A dverb
F aroes e
Comparing Verb/Adverb orders
• Our Faroese subjects dispreferred the order Finite Verb - Negation in
an unambiguously non-V2 context to the same extent that the Danish
subjects did.
• However, our Faroese subjects found Verb-Adverb orders better than
Verb-Negation orders (this effect was found neither in Danish nor in
Icelandic).
• It is possible that to the extent that IP-internal verb movement is still
grammatical in Faroese, for some speakers it is to an intermediate
position.
Looking at the effect of V2
The best measure of the effect of V2 is to look at the
difference between the Subject-Initial and Adjunct-Initial
order, for each clause type:
That is, what is the difference between the scores for
sentences of type (a) and type (b) for each clause type?
(a) Order: Subject-Initial
Lív segði, at hon kom seint til arbeiðis í gjár
Lív said that she came late to work
yesterday
(b) Order: Adjunct-Initial
Beinir segði, at í morgin kemur hann seint til arbeiðis
Beinir said that tomorrow comes he late to work
The effect of of V2 in different clause ty pes
0 .2 0 0
0 .0 0 0
M ain
Bridge
N onBridge A
N onBridge B
I nd qu
Difference in z-scores
- 0 .2 0 0
- 0 .4 0 0
- 0 .6 0 0
I c elandic
D anis h
Faroes e
- 0 .8 0 0
- 1 .0 0 0
- 1 .2 0 0
- 1 .4 0 0
- 1 .6 0 0
Clause t ype
The effect of V2: Danish
• In Danish there was a significant difference between the effect of V2 in
a main clause and after the second category of “nonbridge” verbs
(deny, doubt, be proud).
• There was however no significant difference between the effect of V2
in a main clause and after the first category of “nonbridge” verbs
(regret, admit).
• Taken together, this suggests that for this language Vikner’s original
categorisation of “bridge” verbs for V2 is not correct; instead these
results are more consistent with the proposals in Bentzen et al (2007)
or Julien (2007).
The effect of V2: Faroese and Icelandic
• In Faroese and Icelandic, however, there is no significant
difference between the effect of V2 in a main clause and
after the second category of “nonbridge” verbs.
• This suggests that V2 in these languages targets a different
projection than in Danish (and the other mainland
Scandinavian languages?)
Is apparent V-to-I really V2?
V2:
• Clause Type: Nonasserted
Order: Subject-Initial
Ronaldo noktar, at hann hevur skrivað undir sáttmála við Liverpool næsta ár
Ronaldo denies that he has signed
contract with Liverpool next year
• Clause Type: Nonasserted
Order: Adjunct-Initial
Næmingarnir noktaðu, at í fríkorterinum høvdu teir roykt á vesinum
Students-def denied that in breaks
had
they smoked in toilets-def
“V-to-I”
• Clause Type: Nonasserted
Order: Negation-Verb
Handilskvinnan noktaði, at hon ikki hevði læst handilin í gjárkvøldið
Shopkeeper
denied that she not had locked shop-def yesterday evening
• Clause Type: Nonasserted
Order: Verb-Negation
Sámal noktaði, at hann hevði ikki latið
sjálvuppgávuna inn til tíðina
Sámal denied that he had not handed assignment
in on time
Effect of “verb movement”
0.2
0
z-scores(log)
-0.2
"said..."
"denied..."
Ind Qu
-0.4
V2
V-to-I
-0.6
-0.8
-1
-1.2
-1.4
Clause type
Conclusion
• Judgment data are important for linguistic analysis,
especially where corpora are not available, but even where
they are.
• In investigating language we are always dealing with
behaviour, when we want to learn about knowledge.
Investigating different types of behaviour may help us to
narrow down the range of possibilities
• Magnitude Estimation is a method for gathering judgment
data that allows for a wider range of analytical tools than
many other techniques
All data collected by Zakaris Svabo Hansen for the project
Verb movement in contemporary Faroese
http://www.ling.ed.ac.uk/~heycock/faroese-project.shtml
Project funded by the Arts and Humanities Research Council
Some References
•
•
•
•
•
•
•
Bard, E.G., Robertson, D. and Sorace, A. 1996. Magnitude estimation of
linguistic acceptability. Language 72: 32-68.
Featherston, S. (2005). Magnitude estimation and what it can do for your
syntax: Some wh-constraints in German. Lingua, 115:1525–1550.
Featherston, S. (2007). Data in generative grammar: the stick and the carrot.
Theoretical Linguistics, 33(3):269–318.
Keller, F. 2003. A psychophysical law for linguistic judgments. Proceedings of
the 25th Annual Conference of the Cognitive Science Society. Mahawah:
Lawrence Erlbaum.
Sorace, A. 1996. The use of acceptability judgments in second language
research. In V. T. Bhatia and W. Ritchie (eds.) Handbook of Second Language
Acquisition. New York: Academic Press, p. 375-409.
Sorace, A. & Keller, F. in press. Gradience in linguistic data. To appear in
Lingua.
Sprouse, J. 2007. A program for experimental syntax: Finding the relationship
between acceptability and grammatical knowledge. PhD thesis, University of
Maryland, College Park.