Functional Generative Description
Download
Report
Transcript Functional Generative Description
Tectogrammatical Representation of English
in Prague Czech-English Dependency
Treebank
Lucie Mladová
Silvie Cinková, Kristýna Čermáková, Anja Nedoluzhko, Jana
Šindlerová, Josef Toman, Zdeněk Žabokrtský
June 6, 2007
3rd PIRE Meeting
1
Outline:
●
●
●
Functional Generative Description
Parallel Treebanks
PCEDT 2.0 – Project Report
tectogrammatical level of annotation
valency treatment
annotation manual for English
interannotator agreement
June 6, 2007
3rd PIRE Meeting
2
Functional Generative Description
●
●
Basic approach for Prague Treebanks
dependency
stratificational description of the language:
From structure to function (meaning) - 3 layers of
annotation:
morphological
analytical (=surface syntax)
tectogrammatical (=“deep“ syntax, semantics)
June 6, 2007
3rd PIRE Meeting
3
Functional Generative Description
●
●
Since 1995: Prague Dependency Treebank (PDT) > Czech data
(1.0
released LDC 2001, 2.0 – LDC 2006)
The idea of a parallel corpus: English data, Czech
data – translated: Prague Czech-English Dependency
Treebank (PCEDT)
(1.0 released LDC 2004)
June 6, 2007
3rd PIRE Meeting
4
The Idea of a Parallel, Syntactically
Annotated Corpus
Build an English corpus in the same formalism as PDT
(data resource: Wall Street Journal section of Penn Treebank)
Translate it into Czech
Manual annotations of both parts of the corpus
Train tectogrammar-based machine translation
June 6, 2007
3rd PIRE Meeting
5
Phrasal
x Dependency Tree
Mr. Payson, an art dealer and collector, sold Vincent van Gogh's "Irises"
at a Sotheby's auction in November 1987 to Australian businessman Alan
Bond.
June 6, 2007
3rd PIRE Meeting
6
Dependency Trees:
a-layer = surface syntax
t-layer = underlying
syntax, semantics
It may have been painted instead by a Rubens associate.
June 6, 2007
3rd PIRE Meeting
7
Dependency Trees:
a-layer = surface syntax
t-layer = underlying
syntax, semantics
It may have been painted instead of Rubens by a Rubens associate.
June 6, 2007
3rd PIRE Meeting
8
Tectogrammatical Representation
(t-tree) Contains:
●
●
●
syntactic dependency and coordination: edges
semantic relations: tectogrammatical functors
verb arguments (inner participants)
● semantic ACT, PAT
● syntactic ADDR, ORIG, EFF
free modifications (e.g. TWHEN, LOC, DIR,
MANN,CAUS, CPR, ACMP...)
other: rhematizers, idiomatic expressions, foreign
phrases...
valency of the verbs: valency lexicon EngValLex
June 6, 2007
3rd PIRE Meeting
9
Tectogrammatical Representation
(t-tree) Contains:
●
●
●
links to the lower layers
grammatical (and textual) coreference
topic-focus articulation
June 6, 2007
3rd PIRE Meeting
10
Building the PCEDT 2.0,
the Current Annotation
of the English Data
additional work
work with the corpus data
●
●
●
●
input: WSJ texts (PTB),
approx. 50 000 sentences
(1.2 million words),
automatically converted into
PDT-like shape – a-layer
●
●
automatic t-layer procession
manual annotation running
(approx. 4000 trees
annotated)
meanwhile – Czech section
annotation of the t-layer
launched
June 6, 2007
●
●
●
conversion of the PropBanklexicon into EngVallex (verbs
only)
tools adjustment (TrEd,
unified macros for both CZ
and ENG annotation)
interannotator-agreement
measuring
first version of the annotation
manual, is being revised
training of new annotators
3rd PIRE Meeting
11
EngValLex
●
●
●
●
adaptation of PropBank into the format of PDT-Vallex
(Valency lexicon for Czech)
manual correction
continuous checking during the annotation
current version contains only verbs
future work on EngValLex:
● defining surface realizations – morphosyntactic
characteristics of the semantics roles
● valency of nouns and adjectives
June 6, 2007
3rd PIRE Meeting
12
Annotation Manual
= "Annotation of English on the tectogrammatical level:
Reference book"
● based on the abbreviated version of the annotation
manual for PDT (Czech)
● chapters specific to English data annotation added
● first rough version 1.0.1: April 2007
● revision in progress
● extensions planned (concurrently with the annotation)
June 6, 2007
3rd PIRE Meeting
13
Interannotator Agreement
●
●
●
monthly control of the annotation consistency
approx. 30 trees
measured:
structure: agreement in parent node
functors
further analysis:
list of unpaired nodes
statistics for diverging functors
elimination of detected annotation divergences at
annotator meetings
June 6, 2007
3rd PIRE Meeting
14
Average Interannotator Agreement
June 6, 2007
3rd PIRE Meeting
15
Future goals
●
●
●
annotation expansion
500 trees/annotator/month
increasing (or at last keeping) the interannotator
agreement
training of new annotators
EngValLex precision
annotation manual precision and expansion
June 6, 2007
3rd PIRE Meeting
16
Acknowledgements
The work on PCEDT project is supported by the grants
PIRE ČR ME838 and GA405/06/0589.
June 6, 2007
3rd PIRE Meeting
17
Acknowledgements
The work on PCEDT project is supported by the grants
PIRE ČR ME838 and GA405/06/0589.
Thank you for your attention!
June 6, 2007
3rd PIRE Meeting
18