Behavioral evaluation of corpus representativeness for Maltese

Download Report

Transcript Behavioral evaluation of corpus representativeness for Maltese

How specialized are specialized
corpora?
Behavioral evaluation of corpus representativeness for Maltese
Jerid Francom (Wake Forest University)
Adam Ussishkin (University of Arizona)
Amy LaCross (University of Arizona)
19 May 2010: O7 (Evaluation of Methodologies), 14.45-15.05
LREC 2010, Mediterranean Conference Center
Valletta, Malta
Acknowledgements
Generous contribution of data to this project by
Dr. Albert Gatt (Univ. of Malta)
Statistical expertise from Jeff Berry (Univ. of
Arizona)
Funding from the United States National
Science Foundation (BCS-0715500) to Adam
Ussishkin
2
Goals
Issue
For many languages, the quality of available textual data is
less than ideal for corpus creation in the light of standard
sampling practices.
Propose
Behavioral data can provide a valuable metric to evaluate
corpus resources otherwise considered ‘specialized’.
Case
PsyCoL Maltese Lexical Corpus
Contribute
Novel, cross-discipline metric for evaluating the quality of
language resources
3
Sparse coverage
Most of the world’s 57000 languages have no
corpus resources
Efforts to fill the gap,
often exploit the
availability of language
data on the web
An Crúbadán project,
446 languages
(Scannell, 2007)
McEnery et al.,
(2006) survey of
recent work
4
Sparse coverage
Low-density languages
(Borin, 2009)
Languages in which
resources exist; but in
limited quantity/quality
Limited access to print
and/or electronic data
Available primary data
may be less-thanrepresentative
Weakens assurance
that results from lowdensity language
resources are credible
5
Corpus representativeness
What is a ‘representative
corpus’?
An externally valid sample of
language use
A sample that
approximates what the
language is.
Full range of structural
types (language units)
What are the characteristics of
such a sample?
Genre/register
Modality
6
An issue for low-density languages
Standard practice to achieve
representativeness
Apply rigorous sampling methods
Collect large amounts of data
Problematic for low-density languages: a
representativeness bottleneck
Lack large amounts of data
Available data is often limited in register,
modality, etc.
Corpus resources are typically specialized
7
Assessing representativeness
How do we know whether we have a
‘representative’ sample?
We don’t, in an absolute sense.
Faith in survey sampling practices
Casting the net far and wide
Can we be assured we don’t have a
representative sample?
Not exactly.
•
It is logically possible that smaller, less diverse
samples are externally valid for linguistic units
that appear in the collection.
8
Proposal
Need for an external metric.
Current proposal suggests findings from
behavioral experimentation can provide a
valuable metric to evaluate corpus resources.
Exploit the correlation between derived
frequency counts and elicited behavioral
reactions
Behavioral data and adjusted frequency
(Gries 2008; 2009)
Of particular importance for specialized corpora
9
Behavioral findings
Well-known robust effects for relative frequency
in language processing
Word naming RTs (e.g., Forster & Chambers, 1973)
Lexical decision RTs (e.g., Carroll & White, 1973)
Sentence reading RTs (e.g., MacDonald, 1994)
Word familiarity ratings (e.g., Gernsbacher 1984)
Log frequency is a good predictor of behavior.
10
Approach
Evaluating corpus representativeness through
behavioral assessment
1. Derive frequency counts from a specialized
corpus
2. Elicit behavioral response of participants from
target population
3. Assess correlation strength: how well do
behavioral responses correlate with corpus
measures?
11
Case study and predictions
Case study
Calculate: log frequency of subset of items in a
Maltese lexical corpus
Measure: subjective word familiarity ratings of native
speakers of Maltese
Assess: relative distribution of the measures
Prediction
Congruence between relative distributions indicates a
representative sample of the language
Mismatches underscore potential sampling issues
12
The specialized corpus
PsyCoL Maltese Lexical Corpus (PMLC)
(Francom, Ussishkin, and Woudstra, 2009)
http://psycol.sbs.arizona.edu/resources/
Online Maltese newspapers, 1998-1999; 2005 2007
PsyCoL lab (59.8%) and Dr. Albert Gatt (40.2%)
3,323,325 total tokens (53,000 unique)
Token/type ratio of 1.6%
Typical for low-density languages
Large corpus, still relatively small (cf. British
National Corpus 100+million; Corpus of
Contemporary American English 400+ million)
Limited in register, modality
13
Linguistic variable to quantify
Because there is little previous quantitative
research on Maltese, the empirical focus of this
investigation was narrowed to:
Semitic-origin verbs/binyanim (also known as
form)
Semitic-origin verbs in Maltese conform to the
classical Semitic binyan system (categories
based on morphosyntactic and phonological
properties)
Question: How does frequency as measured in
our corpus correlate with behavior?
Can the binyan categories be exploited to
provide correlations?
14
Maltese binyanim
Binyan
Function
Prosodic shape
Example
1
basic active (transitive or intransitive)
CVCVC
kiser ‘he broke’
2
intensive of 1, transitive of 1
CVCCVC
kisser ‘he smashed’
3
transitive of 1
CV:CVC
bi:rek ‘he blessed
5
passive of 2, reflexive of 2
tCVCCVC
tkisser ‘it got smashed’
6
passive of 2, reflexive of 3
tCV:CVC
tki:teb ‘he
corresponded’
7
passive of 1, reflexive of 1
nCVCVC
nkiser ‘it got broken’
8
passive of 1, reflexive of 1
CtVCVC
ftakar ‘he remembered’
9
inchoative, acquisition of a quality
CCV:C
hma:r ‘he blushed’
10
originally inchoative
stVCCVC
stenbah ‘to wake’
15
A behavioral task: word familiarity
•
•
•
We devised three tests to measure corpus
representativeness
Each test measured a different aspect of our
corpus counts and our behavioral task.
The behavioral task involved native Maltesespeakers, who gave subjective word familiarity
ratings for all Semitic-origin Maltese verbs taken
from Aquilina (2000); n=1536.
Scale from very unfamiliar to very familiar
Shown to be a reliable predictor of lexical processing
(Connine et al. 1990)
16
Word familiarity experiment
Participants
107 native speakers of Maltese
Task
Subjective word familiarity task, online
17
Measuring frequency in the corpus
•
•
•
We then used the PMLC to calculate word
frequency measures for the same set of verbs.
Using regular expression-enabled searching,
we counted token frequency for all verbs
occurring in the PMLC (n=447).
Frequency was then encoded as a log-based
measure.
18
Three tests
•
Next, we conducted three distinct statistical
analyses to assess correlation between these
corpus measures and the results of our word
familiarity experiment
•
•
•
1. Statistical regression between corpus log
frequency and behavioral data.
2. Binned groups by frequency to determine
whether any correlation is found.
3. Binned items by binyan to determine whether
any correlation is found.
19
1. Statistical regression
•
We found a weak correlation (r=.14); these
results show at best a trend toward correlation,
but suggests that familiarity ratings likely do not
predict word frequency given these results.
20
2. Binning by frequency
Binning into two bands shows a correlation:
•
•
Binning into three bands also shows a
correlation:
21
2. Binning by frequency
•
An LMER analysis of each binning (2 groups
and 3 groups) shows significance:
•
•
All contrasts for two-bin intervals (High/Low=4.2,
t=2.0) and three-bin intervals (High/Mid=7.1, t=3.9;
Mid/Low=7.0, t=2.2) were significant.
These results support the hypothesis that
behavior and corpus measures are correlated.
22
3. Binning by binyan
•
•
Earlier and ongoing work (Frost et al. 1997,
1998, 2000; Ussishkin et al. in progress) shows
binyan effects in Hebrew in both visual and
auditory modalities, so Maltese could be
expected to show similar effects.
Our goal here is to measure whether verbs,
when grouped by binyan, show a correlation
between word frequency measures and word
familiarity ratings.
23
3. Binning by binyan
•
Only binyanim 1, 2, 5, 7 were analyzed;
binyanim 3, 6, 8, 9, and 10 were not included in
the analyses because they are so sparsely
populated:
24
3. Binning by binyan
•
•
Word frequency results: significant contrasts
found between Binyanim 7 and 2 (β=.54, t=6.0);
and between Binyanim 7 and 5 (β=1.15, t=-2.2).
Word familiarity results: no significant contrasts
found.
Binyan by word frequency
Binyan by word familiarity
25
General assessment
•
•
•
The results show that verb frequency
distributions in the PMLC pattern to some
degree with the psychological representations
of native speakers (the representative
population)
On the surface suggests the PMLC is on the
right track, but underscores the specialized
nature of corpus
However, a response bias in the
word familiarity task may play
a part in the mismatches
•
Ceiling effect may have contributed
to lower correlation scores
26
General assessment
•
Reasons to be optimistic about the
verb distributions in the PMLC:
•
•
•
Distribution of verb count/
frequency (Zipf, 1949)
Distribution of word length/
frequency (Li, 1992)
Both measures trend as expected for
representative samples
27
Conclusion
•
•
•
•
Novel methodology: direct comparison between
corpus resource and behavior.
Highlighting a robust effect from
psycholinguistics (frequency of linguistic units
predicts behavior).
We predicted the opposite could occur; this
provides a way to validate LDL resources.
This approach encourages cross-discipline
endeavors for resource development and
theoretical investigation.
28
•
Thank you very much!
•
Grazzi ħafna!
29