Czech Verbs of Communication and the Extraction of their
Download
Report
Transcript Czech Verbs of Communication and the Extraction of their
Czech Verbs of Communication
and the Extraction of their
Frames
Václava Benešová and Ondřej Bojar
TSD, Brno, 13.9.2006
Institute of Formal and Applied
Linguistics,
{benesova,bojar}@ufal.mff.cuni.cz
1
Introduction
1. VALLEX, Valency Lexicon of Czech
Verbs
2. Automatic Identification of Verbs of
Communication
3. Frame Suggestion
4. Conclusion
TSD, Brno, 13.9.2006
Institute of Formal and Applied Linguistics,
{benesova,bojar}@ufal.mff.cuni.cz
2/14
1. Valency lexicon of Czech
Verbs, VALLEX 1.x, and its Verb
Classes
Verb Classes in VALLEX
Verbs of Communication
TSD, Brno, 13.9.2006
Institute of Formal and Applied Linguistics,
{benesova,bojar}@ufal.mff.cuni.cz
3/14
VALLEX
Theoretical background:
Functional Generative
Description (FGD)
Valency: “ability of lexical
units to bind other
lexical units”
Versions: 1.0, internal 1.5,
2.0 (autumn 2006)
(almost 4300 entries)
Corpus coverage (Czech
National corpus):
● about 10% verbs
occurrences with low
corpus frequency, not
covered (cca 28000
lemmas)
TSD, Brno, 13.9.2006
Institute of Formal and Applied Linguistics,
{benesova,bojar}@ufal.mff.cuni.cz
VALLEX 1.0 covered (1.064
lemmas [ 53.7%
occurrences])
not covered, frequent (20
lemmas [27.9%
occurrences])
not covered, infrequent
(28.385 lemmas [18.3%
occurrences])
VALLEX 1.5 covered
(1.802 lemmas [65.6%
occurrences])
not covered, frequent (4
lemmas [23.4%
occurrences])
not covered, infrequent
(27.663 lemmas [10.9%
occurrences])
4/14
Verb Entry in VALLEX
Verb Entry: set of
valency frame(s)
Valency frame:
sequence of slots
(functor,
morphemic
realization, type
of complement)
Attributes of
valency frames:
gloss, example,
… class
TSD, Brno, 13.9.2006
Institute of Formal and Applied Linguistics,
{benesova,bojar}@ufal.mff.cuni.cz
5/14
Verb Classes in VALLEX
VALLEX 1.0
VALLEX 1.5
Total Verb
Entries
1.437
2.476
Total Verb
Lemmas
1.081
1.844
Total Valency
Frames
4.239
7.080
Valency
Frames with
Class
1.591
[37.5%]
3.156
[44.6%]
Total Classes
16
23
Frame Types
in Class on
Average
6.1
6.1
Classification:
in progress
built from below
emphasis on syntactic
criteria
communication,
mental action,
perception, psych
verb, exchange,
change, phase verbs,
phase of action,
modal verbs, motion,
transport, location, …
TSD, Brno, 13.9.2006
Institute of Formal and Applied Linguistics,
{benesova,bojar}@ufal.mff.cuni.cz
6/14
Communication verbs in VALLEX
VALLEX VALLEX
1.0
1.5
verbs of
announce
ment: že
191
interrogati
ve verbs:
zda
87
135
imperative
verbs: aby
74
105
TSD, Brno, 13.9.2006
276
‘a speaker conveys information to a recipient’
ACT
{nom}
ADDR
PAT/EFF
{gen/dat/acc} {dc,...}
simple information: {říci: say,
informovat: inform, …} + THAT: že
→ verbs of announcement
question: {ptát se: ask, …} +
WHETHER, IF: zda, jestli →
interrogative verbs
commands, bans, warning, …:
{nakázat: order, zakázat: prohibit,
…} + IN ORDER TO, LET: aby,ať
→ imperative verbs
Institute of Formal and Applied Linguistics,
{benesova,bojar}@ufal.mff.cuni.cz
7/14
2. Automatic Identification of
Verbs Communication
Evaluation VALLEX vs. FrameNet
TSD, Brno, 13.9.2006
Institute of Formal and Applied Linguistics,
{benesova,bojar}@ufal.mff.cuni.cz
8/14
Automatic Identification of Verbs
Communication
Search corpus for V+N234+subord{aby,zda,že}
marks each as a communication verb if enough
occurrences are found.
weak points:
TSD, Brno, 13.9.2006
1. eliminates nominal structures:
‘He said the truth about the killer.’
‘He gave her many presents.’ (verb of
exchange)
2. ignores examples where a complement
was not expressed on the surface layer:
‘He said that …’
3. homonymy of conjunctions:
že (that) and aby (in order to)
‘He has done it in order to make money…’
Institute of Formal and Applied Linguistics,
{benesova,bojar}@ufal.mff.cuni.cz
9/14
Evaluation against VALLEX and
FrameNet
golden standards: VALLEX 1.0,
VALLEX 1.5, FrameNet 1.2
ROC curves
TP … true positives (communication
verbs according to a golden
standard and above the threshold)
FP … false positives (non
communication verbs and above the
given threshold)
TPR = TP / P (P the total number
of communication verbs) … true
positive rate
TNR = TN / N (N the total number
of verbs with no sense of
communication)
40 – 50 % communication verbs
identified correctly (for both VALLEX
and FrameNet)
20% falsely marked
TSD, Brno, 13.9.2006
Institute of Formal and Applied Linguistics,
{benesova,bojar}@ufal.mff.cuni.cz
10/14
3. Frame Suggestion
Frame Edit Distance and Verb
Entry Similarity
Experimental Results
TSD, Brno, 13.9.2006
Institute of Formal and Applied Linguistics,
{benesova,bojar}@ufal.mff.cuni.cz
11/14
Frame Edit Distance and Verb Entry
Similarity
FED (number of edit operations: insert,
delete, replace necessary to convert a
hypothesized frame to a correct frame)
ES (entry similarity or expected saving)
ES=1-
min FED(G,H)
FED(G,Ø)+FED(H,Ø)
G … golden verb entries of this base lemma
H … hypothesized entries
Ø … blank verb entry
ES 0% (suggesting nothing), ES 100% (golden frames)
TSD, Brno, 13.9.2006
Institute of Formal and Applied Linguistics,
{benesova,bojar}@ufal.mff.cuni.cz
12/14
Experimental Results with ES
Suggested frames
ES [%]
Specific frame for verbs of
communication, default for
others
38.00
Baseline 1: ACT(1)
26.69
Baseline 2: ACT(1) PAT(4)
37.55
Baseline 3: ACT(1)
ADDR(3,4) PAT(4)
35.70
Baseline 4: Two typical
frames: ACT(1) PAT(4)
39.11
TSD, Brno, 13.9.2006
Institute of Formal and Applied Linguistics,
{benesova,bojar}@ufal.mff.cuni.cz
13/14
Conclusion
Automatic identification of communication
verbs according to the proposed pattern
V+N234+subord{aby,zda,že} performs
satisfactorily (40-50% true positives against VALLEX
and FrameNet, 20% false positives)
FED reveals that more lexicographic labour
could be saved by suggesting more than one
frame per verb -> need to focus on other
classes, too
TSD, Brno, 13.9.2006
Institute of Formal and Applied Linguistics,
{benesova,bojar}@ufal.mff.cuni.cz
14/14