Linguistic Profiling

Download Report

Transcript Linguistic Profiling

Linguistic Profiling
Laura A. Janda
CLEAR (Cognitive Linguistics: Empirical
Approaches to Russian)
UiT The Arctic University of Norway
A Big Question perspective
Big Questions
Transcend theory
Interesting for all linguists
Theory
Helps to focus Big Questions
Operationalization
Facilitates quantitative methods
Overview
1.
2.
3.
4.
5.
6.
7.
8.
Big Questions
Theoretical perspective
Operationalization
Portable
Multipurpose
Examples
Infrastructure
Applications
1. Some Big Questions
What is the relationship between form and meaning?
What is the relationship between lexicon and grammar?
What is the structure of linguistic categories?
What is the structure of linguistic constructions?
2. Theoretical perspective:
Cognitive linguistics
Minimal Assumption: language can be accounted for in terms of
general cognitive strategies
• no autonomous language faculty
• no strict division between grammar and lexicon
• no a priori universals
Usage-Based: generalizations emerge from language data
• no strict division between langue and parole
• no underlying forms
Meaning is Central: holds for all language phenomena
• no semantically empty forms
• differences in behavior are motivated (but not specifically
predicted) by differences in meaning
Big Questions focused by
Cognitive Linguistics
What is the relationship between form and meaning?
How does form reflect meaning?
Can we use difference in form as a measure of meaning?
What is the relationship between lexicon and grammar?
How do we account for meaning in grammar?
Can we use similar models for grammatical meanings?
What is the structure of linguistic categories?
What is relationship between prototype and periphery?
Can we compare category structure across near synonyms?
What is the structure of linguistic constructions?
Are constructions hierarchical or flat?
What is the relationship between constructions and fillers?
3. Operationalization:
Linguistic profiles
Focused subsets of behavioral profiles (Firth 1957, Harris
1970, Hanks 1996, Geeraerts et al. 1999, Speelman et al.
2003, Divjak & Gries 2006, Gries & Divjak 2009)
Grammatical profiling: relationship between frequency
distribution of forms and linguistic categories
Semantic profiling: relationship between meanings
(semantic tags) and forms
Constructional profiling: relationship between frequency
distribution of grammatical constructions and meaning
Radial category profiling: differences in the frequency
distribution of uses across two or more near-synonyms
Collostructional profiling: relationship between a
construction and the words that fill its slots
4. Portable
Linguistic profiles are portable
– across questions
– across theories
– across statistical models
– across languages
Linguistic profiles are a suite of methodological ideas that
make it possible to approach Big Questions empirically from
a variety of angles
Ideally results are also portable across platforms
– open source, open access, available to all researchers
5. Multipurpose
Quantitatively measured results yield real gains in
our understanding of languages
These results can serve multiple purposes:
– resources for language learners and users
– (real, not statistical) machine translation
– documentation and revitalization for minority
indigenous languages
– language policy
6. Examples
•
•
•
•
Grammatical Profiles: TAM in Russian
Semantic Profiles: “Empty” prefixes in Russian
Constructional Profiles: SADNESS in Russian
Radial Category Profiles: Ambipositions in North Saami
For each example we will identify:
• Big Questions
• Theoretical perspective
• Operationalization (Profiling) & statistical methods
• Portability
• Multipurpose applications
Grammatical Profiles: TAM in Russian
Janda, L. A. & Lyashevskaya, O. 2011.
“Grammatical profiles and the interaction of the
lexicon with aspect, tense and mood in Russian”.
Cognitive Linguistics 22:4 (2011), 719-763.
Crash course in Russian TAM
Tense: Past vs. Non-Past
– Non-Past: Imperfective = Present vs. Perfective = Future
Aspect: Perfective (marked) vs. Imperfective (unmarked)
– All forms of all verbs express aspect
– “Aspectual pairs” = same lexical meaning, different aspect, e.g.,
pisat’ ‘write[imperfective]’ vs. napisat’ ‘write[perfective]’
– Aspectual pairs can be formed via both prefixation and suffixation
(perepisat’ ‘rewrite[perfective]’ vs. perepisyvat’ ‘rewrite[imperfective]’)
– ≈1400 imperfective base stems form ≈2000 perfective aspectual
partners using 16 prefixes
– ≈20K perfective stems form imperfective partners using 3 suffixes
– These affixes are traditionally assumed to be “empty”
Mood: imperative, infinitives in modal constructions
Grammatical Profiles: TAM in Russian
Big Questions:
What is the relationship between form and meaning?
➜ between verb inflection and grammatical meaning of aspect?
What is the relationship between lexicon and grammar?
➜ between lexical meaning of verbs and TAM?
Grammatical Profiles: TAM in Russian
Theoretical focus:
Can we measure the expression of aspect
according to distribution of inflected forms?
Can we distinguish between prefixation vs.
suffixation in formation of aspectual pairs?
Can we measure the attraction of lexical classes to
grammatical categories?
Grammatical Profiles: TAM in Russian
Operationalization:
Grammatical profiles: frequency distribution of inflected forms
➜Distribution of Russian verb forms according to
subparadigm
➜Distribution of Russian verbs according to subparadigm
Data:
Approx. 6M verb forms from the Russian National Corpus
(http://ruscorpora.ru/ )
Statistics:
Chi-square, Cramer’s V effect size, distribution plots
What is a grammatical profile?
Verbs have different forms:
eat
eats
eating
eaten
ate
749 M
121 M
514 M
88.8 M
258 M
50%
45%
40%
35%
30%
The grammatical
profile of eat
25%
20%
15%
10%
5%
0%
eat
eats
eating
eaten
ate
Grammatical Profiles of Russian Verbs
Nonpast
Imperfective
Perfective
Past
Infinitive
Imperative
1,330,016
915,374
482,860
75,717
375,170
1,972,287
688,317
111,509
70%
60%
50%
40%
Imperfective
Perfective
30%
20%
10%
0%
Nonpast
07.04.20
17
Past
18
Infinitive
Imperative
chi-squared
= 947756
df = 3
p-value < 2.2e-16
effect size
(Cramer’s V)
= 0.399
(medium-large)
Distribution of Russian verb forms according to subparadigm
Prefixation (dark) vs. suffixation (light):
Statistically significant, BUT effect sizes too
small (0.076 & 0.037)
Distribution of Russian verbs according to subparadigm:
Imperfective verbs and their attraction to imperative
Over 200
outliers
4/7/201
7
20
Imperfective imperative “be doing X!”
• Polite: guest knows what to expect: razdevajtes’ ‘take
off your coat’, sadites’ ‘sit down’
• Insistence: hearer is hesitant: stupajte ‘get going’,
gljadite ‘look’, zabirajte ‘take’
• Insistence: hearer has not behaved properly
(connection with negation): provalivaj ‘get out of here’,
končaj ‘stop’, ne perebivaj ‘don’t interrupt’
• Polite requests: vyručajte ‘help’
• Kind wishes: vyzdoravlivajte ‘get well’
• Idiomatic: davajte posmotrim ‘let’s take a look’
• Idiomatic/culturally anchored: proščaj(te) ‘farewell’,
soedinjajtes’ ‘unite’ (slogan), zapevaj ‘sing’ (army)
Grammatical Profiles: Findings
• Perfective verbs behave differently than imperfective
verbs
• “Verb pairs” behave the same regardless of which type
of morphology (prefixation vs. suffixation) is used to
mark aspect
• We can identify exactly the verbs that are most attracted
to various TAM combinations.
Grammatical Profiles: Portability
•
•
•
Across issues:
– Grammatical profiling and gender stereotypes (Kuznetsova 2012)
Across languages:
– Gives 96% resolution of perfective vs. imperfective for Old Church
Slavonic verbs, as compared with Dostál 1954 (Eckhoff & Janda 2013)
– Planned study of grammatical profiles across 4 languages:
Morphological
Aspect
Morphological
Aktionsart
Russian
+
+
Czech
+
-
N. Saami
-
+
Norwegian
-
-
Across researchers:
– All outlier verbs listed in Janda & Lyashevksaya 2011, data and code for
Eckhoff & Janda 2013 on website
Grammatical Profiles:
Multipurpose Applications
Pedagogical implications:
• Strategic combinations of verbs
and subparadigms
Semantic Profiles: “Empty” prefixes in Russian
Janda, L. A. & Lyashevskaya, O. 2013. “Semantic Profiles
of Five Russian Prefixes: po-, s-, za-, na-, pro-”. Journal of
Slavic Linguistics 21:2, 211-258.
Semantic Profiles: “Empty” prefixes in Russian
Big Questions:
What is the relationship between form and meaning?
➜ ...between prefixes and meanings of verbs?
Are there any “empty” forms?
➜ Are prefixes empty as claimed?
Imperfective base
Prefixed perfective
sovetovat’ ‘advise’
posovetovat’ ‘advise’
varit’ ‘cook’
svarit’ ‘cook’
pisat’ ‘write’
napisat’ ‘write’
tverdet’ ‘harden’
zatverdet’ ‘harden’
gremet’ ‘thunder’
progremet’ ‘thunder’
Semantic Profiles: “Empty” prefixes in Russian
Theoretical focus:
Can we measure the relationship between prefixes and
meanings of verbs?
➜ Distribution of prefixes vs. semantic groups of verbs
How do we show that “empty” forms aren’t really empty?
➜ Show that prefixes have different semantic behaviors
Semantic Profiles: “Empty” prefixes in Russian
Operationalization:
Semantic profiling: relationship between meanings (semantic
tags) and forms
➜Distribution of Russian verb prefixes vs. semantic tags
Data:
382 verbs with “empty” prefixes from the Exploring
Emptiness database (http://emptyprefixes.uit.no/index.php ),
semantic tags independently assigned in the Russian National
Corpus (http://ruscorpora.ru/ )
Statistics:
Chi-square, Cramer’s V effect size, Fisher Test
IMPACT
CHANGEST
BEHAV
S&S
chi-square = 248, df = 12, p = 2.2e-16; Cramer’s V effect-size = 0.8
78%
66%
53%
51%
31%
35%
35%
29%
17%
9%
8%
s
29
15%
14%
5%
po
07.04.20
17
14%
31%
na
6%
1% 1%
za
0%
pro
Attractions and repulsions
measured by Fisher Test
Attractions
Combination
Fisher Test
p-value
pro-/S&S
[+]5.7e-25
po-/CHANGEST
[+]1.3e-18
za-/IMPACT
[+]1.5e-15
s-/BEHAV
[+]2.1e-8
na-/IMPACT
[+]5.3e-7
na-/BEHAV
[+]5.5e-5
po-/S&S
[+]0.0008
za-/CHANGEST
[+]0.01
Neutral
Combination
Fisher Test
p-value
s-/CHANGEST
[-]0.3
pro-/IMPACT
[-]0.1
s-/S&S
[-]0.1
na-/S&S
[-]0.1
po-/BEHAV
[-]0.05
s-/IMPACT
[+]0.015
Repulsions
Combination
Fisher Test
p-value
za-/S&S
[-]2.0e-6
po-/IMPACT
[-]0.0002
pro-/BEHAV
[-]0.0004
na-/CHANGEST
[-]0.001
za-/BEHAV
[-]0.002
pro-/CHANGEST
[-]0.002
Attractions, neutral relationships, and repulsions between prefixes and semantic
classes
Semantic Profiles: Findings
• Each prefix has a unique semantic profile
• Each prefix is attracted to and repulsed by a different set
of semantic classes of verbs
• It is possible to establish meanings of prefixes and
expectations for how prefixes combine with verbs
Semantic Profiles: Portability
All data, statistical code, lists of verbs available at:
http://emptyprefixes.uit.no/semantic_eng.htm
Semantic Profiles: Multipurpose
Applications
Pedagogical implications:
We can design materials that
reduce the burden of memorizing
≈2000 correct prefix-verb
combinations
Constructional Profiles: SADNESS in Russian
Janda, L. A. & Solovyev, V. 2009. “What Constructional
Profiles Reveal About Synonymy: A Case Study of Russian
Words for sadness and happiness”. Cognitive Linguistics
20:2, 367-393.
Crash course in Russian case & SADNESS
Nouns are obligatorily case-marked
6 cases: Nominative, Accusative, Dative, Instrumental,
Genitive, Locative
– All cases can appear with a preposition
– All cases except Locative can also appear without a preposition
– 70 constructions [(preposition) [NOUN]case]
SADNESS: 6 near-synonyms, no “umbrella term”
– grust’, melanxolija, pečal’, toska, unynie, xandra
Constructional Profiles: SADNESS in Russian
Big Questions:
What is the relationship between form and meaning?
➜What is the relationship between words and
grammatical constructions?
➜What is the relationship between synonyms?
Constructional Profiles: SADNESS in Russian
Theoretical focus:
Can we measure the difference between synonyms in
terms of distribution in grammatical constructions?
Constructional Profiles: SADNESS in Russian
Operationalization:
Constructional profiling: relationship between frequency
distribution of grammatical constructions and meaning
➜SADNESS words vs. distribution in [(preposition) [NOUN]case]
constructions
Data: 500 sentences for each word from Russian National Corpus,
Biblioteka Maksima Moškova
Statistics:
Chi-square, Cramer’s V effect size, Hierarchical Clustering
(squared Euclidean distance)
Chi-square = 730.35,
df = 30, p < 0.0001,
Cramer’s V = 0.305
‘Sadness’
Hierarchical Cluster
pečal’
toska
xandra
melanxolija grust’
unynie
Constructional Profiles: Findings
Each synonym has a unique constructional profile
Some synonyms are closer together, others are
farther apart
Constructional Profiles: Portability
• Across issues:
– Logistic regression analysis of Russian gruzit’ ‘load’ with 3
“empty” prefixes across Locative Alternation constructions
(Sokolova 2012, Sokolova, Janda and Lyashevskaya 2012)
– Analysis of aspectual pairs formed by prefix pro(Kuznetsova 2012)
• Across languages:
– North Saami anaphoric possessive constructions: reflexive
pronoun vs. possessive suffix (forthcoming)
• Data published in Janda & Solovyev article; data and code for
gruzit’ on website.
Constructional Profiles:
Multipurpose Applications
Pedagogical implications:
Teach relevant constructions with near-synonyms
Possible implication for machine translation:
Lexical selection informed by constructional profiles
Radial Category Profiles: Ambipositions in
North Saami
Antonsen, L., Janda, L. A., & Baal, B. A. B. “Njealji davvisámi
adposišuvnna geavahus” [“The Use of Four North Saami Adpositions”],
co-authored with Lene Antonsen[1] and Berit Anne Bals Baal[3], Sámi
dieđalaš áigečála 2012, v. 2. 32pp.
Janda, L. A., Antonsen, L. & Baal, B. A. B. Forthcoming. “A Radial
Category Profiling Analysis of North Sámi Ambipositions”. High Desert
Linguistics Society Proceedings, Volume 1. 11 pp.
Crash course in
North Saami ambipositions
Unusually large number of adpositions that can appear as both
prepositions and postpositions, always use Genitive case
1. a. miehtá dálvvi
b. dálvvi miehtá
[over winter-G]
[winter-G over]
‘during the winter’
2. a. čađa áiggi
b. áiggi čađa
[through time-G] [time-G through]
5 = North Saami
‘through time’
3. a. rastá joga
b. joga rastá
[across river-G]
[river-G across]
‘across the river’
4. a. maŋŋel soađi
b. soađi maŋŋel
[after war-G]
[war-G after]
‘after the war’
Radial Category Profiles:
North Saami ambipositions
Big Questions:
What is the relationship between form and meaning?
➜What is the relationship between position (preposition
vs. postposition) and meaning?
What is the influence of majority languages (prepositional
languages in West vs. postpositional languages in East)?
Is there a relationship between frequency of ambipositions
and their use to distinguish meaning?
Radial Category Profiles:
North Saami ambipositions
Theoretical focus:
Can we measure the difference between uses in
preposition vs. postposition?
Can we model the meanings in terms of a radial category?
Can we measure dialectal differences?
Radial Category Profiles:
North Saami ambipositions
Operationalization:
Radial category profiling: differences in the frequency distribution
of uses across two or more near-synonyms
➜Distribution across uses in radial category for preposition vs.
postposition
Data: 100+ sentences for each position from 10M word newspaper
corpus, plus exx. from literature, Bible translation
Statistics:
Chi-square, Cramer’s V effect size
Radial categories:
miehtá ‘over’ in newspapers
time
9%
extent
79%
time
95%
extent
5%
motion
12%
preposition
postposition
chi-squ = 170, df = 2, p < 2.2e-16; Cramer’s V = 0.85
07.04.20
17
49
100.00%
Distribution of adpostitions
90.00%
Х2=129.7, df=2, p<2.2e-16
Cramer’s V=0.48
80.00%
70.00%
60.00%
% PR
50.00%
% PO
40.00%
30.00%
20.00%
10.00%
0.00%
S. Troms
07.04.20
17
50
Kautokeino
Tana
NT
Newspapers
Radial Category Profiles: Findings
There is a relationship between meaning and position
Prevailing trends in majority languages do influence use of
position
There seems to be a typological relationship between
frequency of ambipositions and their use to distinguish
meaning
Languages with few ambipositions (Germanic, Russian)
do not use position distinctively
Languages with more ambipositions use them in more
complex ways (North Saami > Finnish, Estonian)
Radial Category Profiles: Portability
• Across issues and languages:
– Russian prefixes vy- vs. iz- (Nesset, Endresen, Janda
2011)
– Russian prefixes o-/ob-/obo- (Baydimirova [Endresen]
2010)
• Data and code published on website.
Radial Category Profiles:
Multipurpose Applications
Pedagogical implications:
Teach ambipositions with relevant meanings and nouns
Improvements to constraint grammar analyzer:
Improves linguistic analysis and language technology
tools, these are crucial to preserving and revitalizing the
language
7. Infrastructure
Data management issues:
Remember those problems with portability?
--Data analyzed in proprietary programs
--Data not publicly available or hard to navigate
http://www.youtube.com/watch?v=N2zK3sAtr-4
TROLLing
Tromsø Repository of Language and
Linguistics
•International archive of data and code
•All items open-source, open access
•Searchable metadata
•Verify results, see how to implement various
statistical models
•Housed at UiT library
•Connected to CLARIN (Common Language
Resources and Technology Infrastructure, a
networked federation of European data
repositories)
8. Applications
If we have a finding that is connected to a Big Question and
is statistically robust, it should also be
Language technology can implement useful results in:
--Disambiguation, Parsing
These feed into:
--Pedagogical applications
--Machine translation
A model for
--Corpus analysis tools
applications:
--Language revitalization
http://giellatekno.uit
--Language proofing tools
.no/english.html