Lexicon-Grammar of Russian Verbal Idioms

Download Report

Transcript Lexicon-Grammar of Russian Verbal Idioms

Lexicon-Grammar
of Russian Verbal Idioms
Tetyana Fukova
University of Algarve - FCHS (Portugal)
Supervisors: Jorge Baptista, University of Algarve – FCHS
and INESC-ID Lisboa - Spoken Language Lab (Portugal)
Svitlana Chornobay, Crimean Federal University (Crimea)
idioms
verbal idioms ( frozen sentences):
Frozen sentences are elementary sentences where the main verb and at
least one of its argument are distributionally invariable; usually, the
global meaning of the expression cannot be calculated from the
individual meaning of its component elements when they are used
independently:
держать язык за зубами
(derzhat’ jazyk za zubami)
holdV one’s tongueC-acc behindPrep
one’sN-gen teethC-acc
2
‘keep one's tongue between one's teeth’
objectives
• to determine the relevant linguistic information required
to identify Russian verbal idioms in texts
• to formalize that information into a database of idioms
• to build a library of finite-state transducers (FST) to process
those idioms in real texts
• to evaluate the performance of the FST library
3
linguistic resources and tools
o available linguistic resources
• phraseological dictionaries
(Molotkov, 1986; Fedosov and Lapisky, 2003)
• the machine readable dictionary (distributed with
Unitex)
o linguistic development platform UNITEX
(Paumier 2003, 2014)
4
methods
data collection
• about 1,000 verbal idioms collected
from phraseological dictionaries (current, frequent)
• classified using the Lexicon-Grammar framework
(M. Gross 1982, 1996)
• formalized in tabular format, forming a database
aimed at computational processing of texts
• fine-grained description of the idiomatic expressions:
- syntactic structure
- lexical content of frozen elements,
- distributional constraints on free syntactic slots (± human)
- transformational properties (Passive, permutation)
5
methods
data collection
• each idiom:
- entry (verb in infinitive)
- word-by-word translation
with relevant morphosyntactic information (e.g. case)
- free translation (gloss) or English equivalent
- illustrative example
6
methods
classification
•
•
7
inspired in M. Gross (1982, 1996) proposal for French idioms
already adapted for other languages (including non-Latin languages)
- French (the four main varieties):
French, Belgium, Switzerland and Québec
(Lamiroy 2010)
- Greek (Fotopoulou 1997)
- Italian (Vietry 2015)
- Portuguese , both European and Brazilian
(Baptista et al. 2004, 2014;Vale 2001)
methods
classification
• M. Gross (1982) original classification proposal, based on:
- number of free (N) and frozen (C) slots
- their position as subject, first, or second complement
- the prepositions introducing the complements / case
- transformational properties
(e.g. Passive, permutation, etc.)
• adapted to Russian in order to encompass CASE
8
Classification of Russian verbal idioms
Class
C1
CP1
CAN
CPN
9
Structure
Example
Count
N0 V C-acc1
Бить баклуши (bit’ baklushi)
N0 beat/V spoons/C1-acc
‘to twiddle one's thumbs, to be idle’
245
N0 V (Prep1) C1
Влететь в копеечку (vletet’ v kopeechky)
N0 fly/V in/Prep penny/C1-acc
‘to cost smb. a pretty penny’
298
N0 V (C-acc N-gen)1
= N0 V (C-acc1 N-dat2)
Заговаривать зубы (zagovarivat’ zubi)
N0 talk/V teeth/C1-acc smb/N-dat|gen
‘distract the interlocutor by talking about
extraneous matters’
28
N0 V Prep (C-acc Ngen)1
Играть на нервах (igrat’ na nervah)
N0 play/V on/Prep nerves/C1-obliq N-gen
‘to jangle on someone's ears/nerves’
28
Classification of Russian verbal idioms
Class
C1PN
CNP2
C1P2
CPP
Structure
Example
Count
N0 V C-acc1 (Prep2) N2
Задать пару (zadat’ paru)
N0 set/V steam/C1-acc smb/N2-dat
‘to give smb. hell’
80
N0 V N-acc1 (Prep2) C2
Взять под крыло (vzyat’ pod krilo)
N0 take/V smb/N1-acc under/prep wing/C2-acc
‘to take smb. under one's wing’
187
N0 V C-Acc1 (Prep2) C2
Брать быка за рога (brat’ bika za roga)
N0 take/V bull/C1-acc of/Prep horns/C2-acc
‘to take the bull by the horns’
98
Лезть в душу без мыла (lezt’ v dushu bez mila)
N0 get/V into/Prep soul/C1-acc
without/Prep soap/C2-gen
‘to try to gain smb.'s favor or trust by cunning’
15
Выходить боком (vihodit’ bokom)
N0 appear/V sideways/Adv
‘to turn out badly’
23
N0 V w (Prep1)
C1(Prep2) C2
CADV N0 V Adv1 w
Total
10
1,002
methods
Corpus collection and annotation
10
•
Russian National Corpus (www.ruscorpora.ru)
•
10 most frequent verbs from the lexicon-grammar
-
держать (derzhat) ‘to hold’, идти (idti) ‘to go’, играть (igrat)
‘to play’, бить (bit) ‘to beat’, смотреть (smotret) ‘to look’,
класть (klast) ‘to put’, лезть (lezt) ‘to climb’, лежать (lezhat)
‘to lie’, выйти (viiti) ‘to go out’, жить (zhit) ‘to live’.
-
excluding verbs that are often support verbs (M.Gross 1996):
брать (brat) ‘to take’, давать (davat) ‘to give’; and
делать (delat) ‘to do’
methods
Corpus collection and annotation
•
•
•
•
top search results for each verb lemma (and inflected forms);
random selection of 50 sentences  Corpus 1
manual annotation of the idioms found (_idiom_)
goal: to have a glimpse of the degree of completeness
of the lexicon-grammar built so far
top search results for each verb lemma+constant
(frozen head noun), allowing a window of up to 3 words;
• random selection of 50 sentences  Corpus 2
• manual annotation of the idioms found (_idiom_)/; #literal#
• goal: to evaluate the adequacy of the FST approach
to the task of identifying verbal idioms in texts.
12
Corpus 1 : data collection (from Russian National Corpus).
verb
translit gloss
держать
derzhat' to hold
идти
idti
to go
играть
igrat'
бить
bit'
RNC
in
diff. diff. LG
sample
idioms entries
(n=50)
Total
LG
entries
w/ V
50,643
5
4
4
33
241,225
1
1
1
23
to play
66,077
0
0
0
14
to beat
33,393
6
5
4
12
смотреть
smotret' to look
157,516
0
0
0
11
класть
klast'
to put
10,458
1
1
1
11
лезть
lezt'
to climb
11,273
3
3
3
9
лежать
lezhat'
to lie
80,235
2
2
2
9
выйти
viiti
to go out
165,311
1
1
1
8
жить
zhit'
to live
187,841
2
2
2
7
Total
1,003,972
21
19
18
137
13
Corpus 2 : data collection (from Russian National Corpus).
Verb
Translit
Gloss
Держать
derzhat'
'hold/keep'
29
1,048
594
Идти
idti
'go'
23
988
490
Играть
igrat'
'play'
13
468
189
Бить
bit'
'beat'
11
501
295
Класть
klast'
'put'
9
229
71
Лезть
lezt'
'climb'
5
225
154
Лежать
lezhat'
'lie'
6
253
152
Выйти
viiti
'go out'
6
229
151
Жить
zhit'
'live'
6
193
80
117
4,430
2,334
Total
14
diff.
idioms
matches idioms
50
methods
building reference graphs (Class C1)
15
methods
building reference graphs (Class CP1)
16
methods
building reference graphs (Class C1P2)
17
methods
resulting graph (C1)
бить баклуши, N0 beat/V spoons/C1-Acc ‘be idle’
18
Representation of Passive in the lexicon
• The Unitex lexical resource for the Russian - sample dictionary
(Nagel, 2002) built from the vocabulary of Dostoevsky’s novel
Игрок (Igrok) ‘The Gambler’
• Passive voice - ‘P’
• same verb form - different lemmas or different inflection codes:
бросался,бросать.V+nsv+tr:PeMVi, ‘throw’
бросался,бросаться.V+intr+nsv:AeMVi
• relative position of semantic information on transitivity
is not consistent
• the passive code ‘P’ corresponds, in fact,
not only to a passive construction
but also to an active-reflexive construction
19
Representation of Passive in the lexicon
• we rendered the dictionary notation formally consistent
• we established a clear distinction, whenever it was possible,
between ‘P’=passive and ‘P’=reflexive values of the
suffix –ся/сь (sya/s’)
• adapted the dictionary to produce revised lexical resources:
- the dictionary of text, excluding all verb forms
(same as original);
- the dictionary of verbs without ся/сь (sya/s’) suffixes;
- the dictionary of forms ending in ся/сь (sya/s’)
20
Evaluation
Corpus 2
Corpus 1
Corpus Class
21
TP
FP
FN
Precision Recall F-measure
C1
37
0
8
1,00
0,82
0,90
CP1
14
0
0
1,00
1,00
1,00
C1P2
7
0
0
1,00
1,00
1,00
Total
58
0
8
1,00
0,88
0,94
C1
251
100
10
0,72
0,96
0,82
CP1
752
105
9
0,88
0,99
0,93
C1P2
172
28
3
0,86
0,98
0,92
Total
1175
233
22
0,83
0,98
0,90
Conclusions
We have presented the project of building a database
of Russian verbal idioms:
• more than 1,000 entries collected from dictionaries (ongoing)
• built a Lexicon-Grammar for those idioms
(with morhposyntactic information and examples)
• adopted M.Gross (1982) formal classification
(the contribution was made to adapt it to a typologically
distinct language)
• Produced a detailed description of each class
and provided examples for each idiom (doesn’t exist yet in Russian)
• built reference graphs for the largest classes
(C1, CP1, C1P2)
• improved the base dictionary provided with Unitex
• two experiments with 2 corpora
(aimed at estimating LG coverage and FST precision)
22
Future work
• Extend the lexical coverage of the lexicon-grammar
• Build FSTs for the remaining classes in the LG
• Address free-order syntax of sentential constituents
in Russian
• Address incompleteness and technical details
of the base dictionary, distributed with Unitex
• Describe idioms corresponding
to support verb constructions
• Include verbal idioms with frozen subject (C0x)
• Signal the ambiguity between idiomatic and literal
meaning of the idioms
23
Final words
This work can be considered a first attempt at the automatic identification
and detection of Russian verbal idioms. Much is still to be done.
The following publications have been produced during the course of this
project:
FUKOVA, T., CHORNOBAY, S., BAPTISTA, J. 2016. Lexicon-Grammar of
Russian verbal idioms. Computerised and Corpus-based Approaches to
Phraseology: Monolingual and Multilingual Perspectives. Proceedings from
Europhras 2015, Malaga, Spain (June 30, 2015).Tradulex: Geneva, pp. 139-153
FUKOVA, T., CHORNOBAY, S., BAPTISTA, J. (to appear). Classification of
Russian verbal idioms. Paper presented at the Web Conference: International
scientific congress «Foreign Philology. Social and national variability of
language and literature», Crimean Federal V.I. Vernadsky University
(April 27, 2016)
24
Obrigada
Thank you !
Cпасибо
25