Functional Generative Description

Download Report

Transcript Functional Generative Description

Tectogrammatical Representation of English
in Prague Czech-English Dependency
Treebank
Lucie Mladová
Silvie Cinková, Kristýna Čermáková, Anja Nedoluzhko, Jana
Šindlerová, Josef Toman, Zdeněk Žabokrtský
June 6, 2007
3rd PIRE Meeting
1
Outline:
●
●
●
Functional Generative Description
Parallel Treebanks
PCEDT 2.0 – Project Report
 tectogrammatical level of annotation
 valency treatment
 annotation manual for English
 interannotator agreement
June 6, 2007
3rd PIRE Meeting
2
Functional Generative Description
●
●
Basic approach for Prague Treebanks
 dependency
 stratificational description of the language:
From structure to function (meaning) - 3 layers of
annotation:
 morphological
 analytical (=surface syntax)
 tectogrammatical (=“deep“ syntax, semantics)
June 6, 2007
3rd PIRE Meeting
3
Functional Generative Description
●
●
Since 1995: Prague Dependency Treebank (PDT) > Czech data
(1.0
released LDC 2001, 2.0 – LDC 2006)
The idea of a parallel corpus: English data, Czech
data – translated: Prague Czech-English Dependency
Treebank (PCEDT)
(1.0 released LDC 2004)
June 6, 2007
3rd PIRE Meeting
4
The Idea of a Parallel, Syntactically
Annotated Corpus
Build an English corpus in the same formalism as PDT
(data resource: Wall Street Journal section of Penn Treebank)
Translate it into Czech
Manual annotations of both parts of the corpus
Train tectogrammar-based machine translation
June 6, 2007
3rd PIRE Meeting
5
Phrasal
x Dependency Tree
Mr. Payson, an art dealer and collector, sold Vincent van Gogh's "Irises"
at a Sotheby's auction in November 1987 to Australian businessman Alan
Bond.
June 6, 2007
3rd PIRE Meeting
6
Dependency Trees:
a-layer = surface syntax
t-layer = underlying
syntax, semantics
It may have been painted instead by a Rubens associate.
June 6, 2007
3rd PIRE Meeting
7
Dependency Trees:
a-layer = surface syntax
t-layer = underlying
syntax, semantics
It may have been painted instead of Rubens by a Rubens associate.
June 6, 2007
3rd PIRE Meeting
8
Tectogrammatical Representation
(t-tree) Contains:
●
●
●
syntactic dependency and coordination: edges
semantic relations: tectogrammatical functors
 verb arguments (inner participants)
● semantic ACT, PAT
● syntactic ADDR, ORIG, EFF
 free modifications (e.g. TWHEN, LOC, DIR,
MANN,CAUS, CPR, ACMP...)
 other: rhematizers, idiomatic expressions, foreign
phrases...
valency of the verbs: valency lexicon EngValLex
June 6, 2007
3rd PIRE Meeting
9
Tectogrammatical Representation
(t-tree) Contains:
●
●
●
links to the lower layers
grammatical (and textual) coreference
topic-focus articulation
June 6, 2007
3rd PIRE Meeting
10
Building the PCEDT 2.0,
the Current Annotation
of the English Data
additional work
work with the corpus data
●
●
●
●
input: WSJ texts (PTB),
approx. 50 000 sentences
(1.2 million words),
automatically converted into
PDT-like shape – a-layer
●
●
automatic t-layer procession
manual annotation running
(approx. 4000 trees
annotated)
meanwhile – Czech section
annotation of the t-layer
launched
June 6, 2007
●
●
●
conversion of the PropBanklexicon into EngVallex (verbs
only)
tools adjustment (TrEd,
unified macros for both CZ
and ENG annotation)
interannotator-agreement
measuring
first version of the annotation
manual, is being revised
training of new annotators
3rd PIRE Meeting
11
EngValLex
●
●
●
●
adaptation of PropBank into the format of PDT-Vallex
(Valency lexicon for Czech)
manual correction
continuous checking during the annotation
current version contains only verbs
future work on EngValLex:
● defining surface realizations – morphosyntactic
characteristics of the semantics roles
● valency of nouns and adjectives
June 6, 2007
3rd PIRE Meeting
12
Annotation Manual
= "Annotation of English on the tectogrammatical level:
Reference book"
● based on the abbreviated version of the annotation
manual for PDT (Czech)
● chapters specific to English data annotation added
● first rough version 1.0.1: April 2007
● revision in progress
● extensions planned (concurrently with the annotation)
June 6, 2007
3rd PIRE Meeting
13
Interannotator Agreement
●
●
●
monthly control of the annotation consistency
 approx. 30 trees
measured:
 structure: agreement in parent node
 functors
further analysis:
 list of unpaired nodes
 statistics for diverging functors
 elimination of detected annotation divergences at
annotator meetings
June 6, 2007
3rd PIRE Meeting
14
Average Interannotator Agreement
June 6, 2007
3rd PIRE Meeting
15
Future goals
●
●
●
annotation expansion
 500 trees/annotator/month
 increasing (or at last keeping) the interannotator
agreement
 training of new annotators
EngValLex precision
annotation manual precision and expansion
June 6, 2007
3rd PIRE Meeting
16
Acknowledgements
The work on PCEDT project is supported by the grants
PIRE ČR ME838 and GA405/06/0589.
June 6, 2007
3rd PIRE Meeting
17
Acknowledgements
The work on PCEDT project is supported by the grants
PIRE ČR ME838 and GA405/06/0589.
Thank you for your attention!
June 6, 2007
3rd PIRE Meeting
18