Transcript Slide 1
Verb Valency Frame Extraction Using Morphological
and Syntactic Features of Croatian
Krešimir Šojat, Željko Agić, Marko Tadić
Department of Linguistics, Department of Information Sciences
Faculty of Humanities and Social Sceinces, University of Zagreb
{ksojat, zagic, marko.tadic}@ffzg.hr
FASSBL 7 Conference
Dubrovnik, Croatia
2010-10-05
Overview
What?
extraction and semi-automatic construction of verb
valency frames
How?
rule-based extraction procedure run on the Croatian
dependency treebank
manual assignment of tectogrammatical functors
inference of rules for assigning functors to unseen
text
Why?
creation of treebank-based verb valency lexicon
enhancement and enrichment of existing resources
Valency frames
valency frame extraction means to detect all
possible environments of particular verb as found in
the treebank
such an approach aims at fast construction of
valency frames
extraction is automatic, no elements of frames
added manually by human annotators
such automatically acquired verb valency lexicon
can serve as a basis for further enrichment and
enhancement of manually constructed resources,
either existing or constructed from scratch
The treebank
Croatian Dependency Treebank (HOBS)
follows the guidelines of the Prague DT
taken from the Croatia Weekly 100 kw sub-corpus of
the Croatian National Corpus (HNK)
XCES-encoded up to the word level
sentence-delimited, tokenized, manually lemmatized
and MSD-tagged
serves as the morphological layer of the treebank
annotated on the syntactic layer
approximately 2.700 sentences, 67.000 tokens
manually assigned syntactic functions
ca 1.300 sentences double-checked and used in this
experiment
The treebank
HR
EN
Unija je već dogovorila neke mjere kako bi pomogla Hrvatskoj.
The Union has already arranged some measures in order to help Croatia.
Extraction algorithm
the algorithm aims at extraction of verb valency
frame instances
for each verb in the treebank sample, it descends
one level down the dependency tree to retrieve
subjects (Sb), objects (Obj), adverbs (Adv) and
nominal predicates (Pnom)
Two levels down to retrieve tokens from the previous
step introduced by subordinate conjunctions (AuxC)
or prepositions (AuxP)
Extraction algorithm
algorithm illustration
dogovorila (dogovoriti Pred)
[Unija Ncfsn Sb] [mjere Ncfpa Obj] [već Rt Adv] [kako Css AuxC]
Extraction algorithm
the first version retrieved predicates only and was
expanded to retrieve all the verbs from the treebank
sample
algorithm adapted to retrieve any verbs found in the
dependency structure, regardless of their respective
analytical functions and position within the dependency
trees
the adaptation itself is implemented in order to raise the
recall of the algorithm, while still maintaining its
precision by not changing the simple set of descending
rules
i.e. to retrieve as much verbs as possible given the
limited size of the treebank sample used in the
experiment
CCCCyyyy
Location
yyyy-mm-dd
Extraction algorithm
the verb “imati” (Vmn) is annotated as object (Obj)
Extraction algorithm
Thus, from each sentence the number of extracted
frames correspondes to the number of verbs:
one frame for the main clause that captures the
whole syntactic structure of the sentence
frames extracted from dependent clauses
naglasio (naglasiti Vmps-sma Pred)
[Mikuška Np-sn Sb] [kako->imati Css AuxC->Obj]
imati (imati Vmn Obj)
[stanovništvo Ncnsn Sb] [korist Ncfsa Obj]
[od->projekta Spsg->Ncmsg AuxP->Adv]
[kroz->ekoturizam Spsa->Ncmsa AuxP->Adv]
Functor assignment
In order to annotate verbal frames we used a set of
5 argument functors and functors for 32 free
modification functors:
Argument functors: ACT, PAT, ADDR, ORIG, EFF
Temporal functors: TWHEN, TFHL, TFRWH, THL, THO,
TOWH, TPAR, TSIN, TTILL
Locative and directional functors: DIR1, DIR2, DIR3, LOC
Functors for causal relations: AIM, CAUS, CNCS, COND, INTT
Functors for expressing manner: ACMP, CPR, CRIT, DIFF,
EXT, MANN, MEANS, REG, RESL, RESTR
Functors for specific modifications: BEN, CONTRD, HER,
SUBS
936 frame instances were manually annotated for 424
different verbs
Results
valency frame frequency across verb lemmas
Verb
biti
imati
reći
dobiti
raditi
kazati
pokazati
postati
vidjeti
dati
Frequency
188
23
15
12
10
9
8
8
8
7
raditi (en. to work, to do)
Valency frame
Frequency
ACT PAT
2
ACT CRIT LOC THL
1
ACT MANN TWHEN
1
ACT MEANS TWHEN
1
ACT PAT TSIN
1
dati (en. to give)
Valency frame
Frequency
ACT ADDR PAT
4
ACT ADDDR PAT
1
ACT ADDR AIM PAT
1
ACT PAT
1
Results
frequency of verb valency frames, i.e. n-tuples of
tectogrammatical functors
Frame
ACT PAT
PAT*
ACT PAT TWHEN
ACT MANN PAT
ACT ADDR PAT
ACT LOC
ACT LOC PAT
MANN PAT
ACT CAUS PAT
ACT MANN
LOC PAT
ADDR PAT
Count
250
157
30
23
20
20
20
17
16
13
12
11
Percent
26.71
16.77
3.21
2.46
2.14
2.14
2.14
1.82
1.71
1.39
1.28
1.18
Other
347
37.07
Results
frames annotated with MSD, analytical functions and
tectogrammatical functors
CCCCyyyy
Location
yyyy-mm-dd
djelovati
(djeluje Pred)
[ neozbiljno Neozbiljno Rnp Adv MANN ]
[ odustajanje odustajanje Ncnsn Sb ACT ]
osloboditi
(oslobodili Pred)
[ ACT ] [ nikada Nikada Rt Adv THL ]
[ zloduh zloduha Ncmsg Obj PAT ]
postati
(postali Pred)
[studij studiji Ncmpn Sb ACT]
[fakultet fakultet Ncmsn Obj PAT]
zaustaviti
(zaustavio Atr)
[ ACT ] [ oni ih Pp3-pa--y-n-- Obj PAT ]
[ dolina u->dolini Spsl->Ncfsl AuxP->Adv LOC ]
Results
Distribution of (MSD, analytical function) pairs
across tectogrammatical functors
ACT (Actor)
PAT (Patient)
LOC (Locative)
A-fun
MSD
%
A-fun
MSD
%
A-fun
MSD
%
Sb
Sb
Sb
Sb
Sb
Sb
Sb
Sb
Sb
Sb
Ncmsn
Np-sn
Ncfsn
Ncmpn
Npfsn
Pi-mpn--n-a-Ncfpn
Ncnsn
Pi-msn--n-a-Pi-fsn--n-a--
14.91
13.50
12.87
9.89
5.65
4.71
3.30
2.98
2.51
1.88
Obj
Obj
Pnom
Obj
Obj
Obj
Obj
Pnom
(AuxC) Obj
Obj
Ncfsa
Ncmsa
Ncmsn
Ncmpa
Vmn*
Ncnsa
Ncfpa
Ncfsn
(Css) Vmip3s
Ncmsn
11.25
9.18
5.69
4.53
4.40
3.75
3.49
2.72
2.07
1.81
(AuxP) Adv
(AuxP) Adv
(AuxP) Adv
(AuxP) Adv
(AuxP) Adv
(AuxP) Adv
(AuxP) Adv
Adv
Adv
(AuxP) Adv
(Spsl) Ncfsl
(Spsl) Ncmsl
(Spsl) Npmsl
(Spsl) Ncnsl
(Spsl) Npfsl
(Spsl) Ncmpl
(Spsl) Ncfpl
Rl
Css
(Spsg)Ncmsg
21.88
16.41
10.16
8.59
8.59
5.47
3.91
3.13
1.56
1.56
serves as basis for defining functor assignment rules
from MSD and analytical function
Conclusions
in this experiment we have designed and
implemented one possible approach:
to semi-automatic extraction of a valency frame
lexicon for Croatian verbs
to the refinement of existing lexicons by using the
Croatian Dependency Treebank as an underlying
resource
we have automatically extracted 2930 verb valency
frame instances and annotated 936 frames:
the distribution of valency frames for each of the
encountered verbs
the distribution of analytical functions and
morphosyntactic tags for each of the
tectogrammatical functors
Future work
the first result enables the enrichment of existing
valency lexicons, such as CROVALLEX
the second result enables the implementation of a
rule-based system for automatic assignment of
tectogrammatical functors to morphosyntactically
tagged and dependency-parsed unseen text
this procedure of automatic detection of valency
frames will be used also in several other projects
dealing with factored SMT (e.g. ACCURAT)
regarding dependency parsing of Croatian by using
the Croatian Dependency Treebank, we shall
undergo various research directions in order to
increase overall parsing accuracy
Thank you for your attention.
www.accurat-project.eu
The research within the project ACCURAT leading to
these results has received funding from the European
Union Seventh Framework Programme (FP7/20072013), grant agreement no 248347.