Transcript Slide 1

Verb Valency Frame Extraction Using Morphological
and Syntactic Features of Croatian
Krešimir Šojat, Željko Agić, Marko Tadić
Department of Linguistics, Department of Information Sciences
Faculty of Humanities and Social Sceinces, University of Zagreb
{ksojat, zagic, marko.tadic}@ffzg.hr
FASSBL 7 Conference
Dubrovnik, Croatia
2010-10-05
Overview
 What?
 extraction and semi-automatic construction of verb
valency frames
 How?
 rule-based extraction procedure run on the Croatian
dependency treebank
 manual assignment of tectogrammatical functors
 inference of rules for assigning functors to unseen
text
 Why?
 creation of treebank-based verb valency lexicon
 enhancement and enrichment of existing resources
Valency frames
 valency frame extraction means to detect all
possible environments of particular verb as found in
the treebank
 such an approach aims at fast construction of
valency frames
 extraction is automatic, no elements of frames
added manually by human annotators
 such automatically acquired verb valency lexicon
can serve as a basis for further enrichment and
enhancement of manually constructed resources,
either existing or constructed from scratch
The treebank
 Croatian Dependency Treebank (HOBS)
 follows the guidelines of the Prague DT
 taken from the Croatia Weekly 100 kw sub-corpus of
the Croatian National Corpus (HNK)
 XCES-encoded up to the word level
 sentence-delimited, tokenized, manually lemmatized
and MSD-tagged
 serves as the morphological layer of the treebank




annotated on the syntactic layer
approximately 2.700 sentences, 67.000 tokens
manually assigned syntactic functions
ca 1.300 sentences double-checked and used in this
experiment
The treebank
HR
EN
Unija je već dogovorila neke mjere kako bi pomogla Hrvatskoj.
The Union has already arranged some measures in order to help Croatia.
Extraction algorithm
 the algorithm aims at extraction of verb valency
frame instances
 for each verb in the treebank sample, it descends
 one level down the dependency tree to retrieve
subjects (Sb), objects (Obj), adverbs (Adv) and
nominal predicates (Pnom)
 Two levels down to retrieve tokens from the previous
step introduced by subordinate conjunctions (AuxC)
or prepositions (AuxP)
Extraction algorithm
 algorithm illustration
dogovorila (dogovoriti Pred)
[Unija Ncfsn Sb] [mjere Ncfpa Obj] [već Rt Adv] [kako Css AuxC]
Extraction algorithm
 the first version retrieved predicates only and was
expanded to retrieve all the verbs from the treebank
sample
 algorithm adapted to retrieve any verbs found in the
dependency structure, regardless of their respective
analytical functions and position within the dependency
trees
 the adaptation itself is implemented in order to raise the
recall of the algorithm, while still maintaining its
precision by not changing the simple set of descending
rules
 i.e. to retrieve as much verbs as possible given the
limited size of the treebank sample used in the
experiment
CCCCyyyy
Location
yyyy-mm-dd
Extraction algorithm
 the verb “imati” (Vmn) is annotated as object (Obj)
Extraction algorithm
 Thus, from each sentence the number of extracted
frames correspondes to the number of verbs:
 one frame for the main clause that captures the
whole syntactic structure of the sentence
 frames extracted from dependent clauses
naglasio (naglasiti Vmps-sma Pred)
[Mikuška Np-sn Sb] [kako->imati Css AuxC->Obj]
imati (imati Vmn Obj)
[stanovništvo Ncnsn Sb] [korist Ncfsa Obj]
[od->projekta Spsg->Ncmsg AuxP->Adv]
[kroz->ekoturizam Spsa->Ncmsa AuxP->Adv]
Functor assignment
 In order to annotate verbal frames we used a set of
5 argument functors and functors for 32 free
modification functors:
 Argument functors: ACT, PAT, ADDR, ORIG, EFF
 Temporal functors: TWHEN, TFHL, TFRWH, THL, THO,
TOWH, TPAR, TSIN, TTILL
 Locative and directional functors: DIR1, DIR2, DIR3, LOC
 Functors for causal relations: AIM, CAUS, CNCS, COND, INTT
 Functors for expressing manner: ACMP, CPR, CRIT, DIFF,
EXT, MANN, MEANS, REG, RESL, RESTR
 Functors for specific modifications: BEN, CONTRD, HER,
SUBS
 936 frame instances were manually annotated for 424
different verbs
Results
 valency frame frequency across verb lemmas
Verb
biti
imati
reći
dobiti
raditi
kazati
pokazati
postati
vidjeti
dati
Frequency
188
23
15
12
10
9
8
8
8
7
raditi (en. to work, to do)
Valency frame
Frequency
ACT PAT
2
ACT CRIT LOC THL
1
ACT MANN TWHEN
1
ACT MEANS TWHEN
1
ACT PAT TSIN
1
dati (en. to give)
Valency frame
Frequency
ACT ADDR PAT
4
ACT ADDDR PAT
1
ACT ADDR AIM PAT
1
ACT PAT
1
Results
 frequency of verb valency frames, i.e. n-tuples of
tectogrammatical functors
Frame
ACT PAT
PAT*
ACT PAT TWHEN
ACT MANN PAT
ACT ADDR PAT
ACT LOC
ACT LOC PAT
MANN PAT
ACT CAUS PAT
ACT MANN
LOC PAT
ADDR PAT
Count
250
157
30
23
20
20
20
17
16
13
12
11
Percent
26.71
16.77
3.21
2.46
2.14
2.14
2.14
1.82
1.71
1.39
1.28
1.18
Other
347
37.07
Results
 frames annotated with MSD, analytical functions and
tectogrammatical functors
CCCCyyyy
Location
yyyy-mm-dd
djelovati
(djeluje Pred)
[ neozbiljno Neozbiljno Rnp Adv MANN ]
[ odustajanje odustajanje Ncnsn Sb ACT ]
osloboditi
(oslobodili Pred)
[ ACT ] [ nikada Nikada Rt Adv THL ]
[ zloduh zloduha Ncmsg Obj PAT ]
postati
(postali Pred)
[studij studiji Ncmpn Sb ACT]
[fakultet fakultet Ncmsn Obj PAT]
zaustaviti
(zaustavio Atr)
[ ACT ] [ oni ih Pp3-pa--y-n-- Obj PAT ]
[ dolina u->dolini Spsl->Ncfsl AuxP->Adv LOC ]
Results
 Distribution of (MSD, analytical function) pairs
across tectogrammatical functors
ACT (Actor)
PAT (Patient)
LOC (Locative)
A-fun
MSD
%
A-fun
MSD
%
A-fun
MSD
%
Sb
Sb
Sb
Sb
Sb
Sb
Sb
Sb
Sb
Sb
Ncmsn
Np-sn
Ncfsn
Ncmpn
Npfsn
Pi-mpn--n-a-Ncfpn
Ncnsn
Pi-msn--n-a-Pi-fsn--n-a--
14.91
13.50
12.87
9.89
5.65
4.71
3.30
2.98
2.51
1.88
Obj
Obj
Pnom
Obj
Obj
Obj
Obj
Pnom
(AuxC) Obj
Obj
Ncfsa
Ncmsa
Ncmsn
Ncmpa
Vmn*
Ncnsa
Ncfpa
Ncfsn
(Css) Vmip3s
Ncmsn
11.25
9.18
5.69
4.53
4.40
3.75
3.49
2.72
2.07
1.81
(AuxP) Adv
(AuxP) Adv
(AuxP) Adv
(AuxP) Adv
(AuxP) Adv
(AuxP) Adv
(AuxP) Adv
Adv
Adv
(AuxP) Adv
(Spsl) Ncfsl
(Spsl) Ncmsl
(Spsl) Npmsl
(Spsl) Ncnsl
(Spsl) Npfsl
(Spsl) Ncmpl
(Spsl) Ncfpl
Rl
Css
(Spsg)Ncmsg
21.88
16.41
10.16
8.59
8.59
5.47
3.91
3.13
1.56
1.56
 serves as basis for defining functor assignment rules
from MSD and analytical function
Conclusions
 in this experiment we have designed and
implemented one possible approach:
 to semi-automatic extraction of a valency frame
lexicon for Croatian verbs
 to the refinement of existing lexicons by using the
Croatian Dependency Treebank as an underlying
resource
 we have automatically extracted 2930 verb valency
frame instances and annotated 936 frames:
 the distribution of valency frames for each of the
encountered verbs
 the distribution of analytical functions and
morphosyntactic tags for each of the
tectogrammatical functors
Future work
 the first result enables the enrichment of existing
valency lexicons, such as CROVALLEX
 the second result enables the implementation of a
rule-based system for automatic assignment of
tectogrammatical functors to morphosyntactically
tagged and dependency-parsed unseen text
 this procedure of automatic detection of valency
frames will be used also in several other projects
dealing with factored SMT (e.g. ACCURAT)
 regarding dependency parsing of Croatian by using
the Croatian Dependency Treebank, we shall
undergo various research directions in order to
increase overall parsing accuracy
Thank you for your attention.
www.accurat-project.eu
The research within the project ACCURAT leading to
these results has received funding from the European
Union Seventh Framework Programme (FP7/20072013), grant agreement no 248347.