Prague Dependency Treebank 1.0 Functional Generative Description

Download Report

Transcript Prague Dependency Treebank 1.0 Functional Generative Description

Prague Dependency
Treebank 1.0
Functional Generative Description
Functional Generative Description




theoretical framework based on the findings of European
structural linguistics, esp. of the classical Prague School
methodological requirements of a formal description
levels:
 tectogrammatical (underlying) representations (TRs) with
dependency based syntax
 morphemics
 phonemics and phonetics
TRs (see Sgall, Hajičová and Panevová 1986, formally specified by
Petkevič, also in a declarative way)
The Language Layers





Phonemic,
Morphonological,
Morphemic,
Analytical (surface syntax)
Tectogrammatical (deep syntax).
Dependency tree
My younger brother arrived there yesterday.
Linearized form, one-to-one relation:
((I)Appurt (younger)Rstr brother)Act arrive.Pret.Indic (Dir there) (Temp yesterday)
Dependency Tree

labels - lexical meanings (abstract symbols) with indices
 functors


grammatemes - values of morphological categories



subscripts at parentheses oriented towards head
Tense, Modality, Number, Definiteness, etc.
projectivity
valency
 arguments (inner participants) and
adjuncts (circumstantials or 'free modifications')
 obligatory and optional with a given head,
 deletable or not
Dependency Tree

Arguments/participan
ts of verbs





Actor/Bearer
(underlying subject)
Objective (Patient,
underlying direct object)
Addressee
(underlying indirect object)
Effect ('second' object: to
choose so. as sth.)
Origin
(to make sth. out of sth.)

Adjuncts


Locative, several
Directional and Temporal
modifications
Condition, Means,
Manner, etc.
Dependency Tree
Complementations dependent mainly on nouns

Arguments (inner
participants)


Material (Partitive)
two baskets of sth.
Identity
the river Danube; the
notion of operator

Adjuncts (free
modifications)

Possession
(Appurtenance)


my table; Jim's brother
Restrictive
rich man
Descriptive
the Swedes, who are a
Scandinavian nation
Dependency Tree

syntactic grammatemes



Loc, Dir - in, on, under, between...
Regard - with, without
operational (testable) criteria

for distinguishing



arguments from adjuncts,
from each other
deletability (dialogue test)
Simplified valency frames



read V Act Addr Obj
change
V Act Obj Orig Eff

brother N Appurt

man N

glass N Material

full A Material
give V Act Addr Obj
obligatory complementations in blue
Topic-focus articulation
T

contextual boundness

there


young
left-to-right order of nodes together
with the index T or (prototypically) F
indicates the TFA of the sentence
(of the TR)

main verb CB/NB (T/F)
dependents to the left/right
communicative dynamism

left-right (mother, sisters,
transitive)

partial ordering
underlying word order

left-right

linear ordering
Topic-focus articulation
T
there
F
yesterday
young

TFA - one of the basic aspects of underlying structures
Complex sentence
My brother, whom you know, arrived there yesterday.

a subordinated (dependent) clause (i.e. its main verb)
depends on a word contained in its governing clause
Complex sentence
Martin came there late, since he had to accompany his sick mother.

function words (synsemantic) are viewed as function
morphemes, syntactically fixed to certain lexical (autosemantic)
words - prepositions and articles to nouns, conjunctions and
auxiliaries to verbs
Complex sentence
Martin arrived late to the session, since he had to accompany his sick mother.
schematically (morphemes):
Martin arrive.ed late to the session since he have.ed to accompany
he.s sick mother.
dot - close connection of morphemes ('semes')

deleted items restored
 order of items - difference between 'underlying' and surface
(morphemic) word order
 transductive components - Panevová, Oliva, Borota

coordination (multidimensional)
 Jim and Mary, who have two children, went to Boston.
 the linearized notation is adequate:
 ((Jim Mary)Conj ((who)Act have (Pat (two)Rstr children)))Act
went (Dir Boston)

structures close to Boolean, i.e. no complex 'innate properties'
specific for natural language are needed.
Prague Dependency Treebank corpus annotation

an intermediate level - 'analytical'
representations



dependency trees, not always projective
nodes for all word tokens, even for punctuation
marks
tectogrammmatical tree: coordinating
conjunction as the head
Prague Dependency
Treebank 1.0
Morphological Layer
ANNOTATED CORPORA
PDT version 1.0, 2000
(1996 - 2000)
(currently) ver. 2
Penn Treebank, release 3, 1999
(1989 - 1999)
PropBank (currently)
The Levels in PDT



Morphemic
Analytical
Tectogrammatical
TAG SETs
Czech - ambiguous inflective language
nový, nového, novému, novém, novým, nová, nové, novou, nových,
novým, novými, … novější, novejšího, novějšímu, novějším, ….,
nejnovější, nejnovějšího, nejnovějšímu, nejnovějším….. nejnovějších,
nejnovějším, …
English - language with poor inflection
work, works, worked, working
TEXT SOURCES

Lidové noviny

´88, ´89 WSJ articles

Mladá Fronta Dnes

Air Travel Information

Vesmír

Českomoravský
System transcripts
Profit
...taken from Czech
National Corpus

Brown Corpus

Switchboard transcripts
ANNOTATION STRATEGY Penn Treebank
TEXT
Ken Church‘s stochastic tagger,
Eric Brill‘s transformation tagger
corrections by annotator (GNU Emacs
Lisp based package)
ANNOTATION STRATEGY - PDT
Automatic Morphological Analyzer (AMA)
two independent annotators; Linux, Win tools
differences resolved by third annotator
comparison with the current AMA;
manual resolution; Win tools
INTERNAL FORMAT

SGML coding, csts dtd

word/tag(|tag)*
SAMPLES
<s id=“ln95040:020-p1s1“>
<f>Pokus<l>pokus<t>NNIS1-----A---<f>o<l>o<t>RR--4---------<f>zázrak<l>zázrak<t>NNIS4-----A---<d>.<l>.<t>Z:------------The/DT envelope/NN arrives/VBZ in/IN the/DT mail/NN ./.
CONVERSION

SGML coding

word/tag
pdt2wsj.pl
pdt2wsjFLT.pl

SGML coding

word/lemma/tag
DATA SIZE
# word
tokens
# sentences
PDT 1.0
1 730K
112K
Penn Treebank
4 600K
350K
release 3
DATA SETs of MORPHOLOGICALLY
ANNOTATED DATA
for tagging only
training data
#tokens/sentences
1 470K/95K
development test data
130K/8K
evaluation test data
127K/8K
for parsing (preprocessing step)
training data
475K/29K
development test data
130K/8K
evaluation test data
127K/8K
TOOLS

Automatic
Morphological
Analyser/Generator of
Czech



HMAnalyze.pl,
HMGenerate.pl
Dictionary: CZE_a
Remote Access

Czech Taggers

HMM

Exponential
Prague Dependency Treebank
1.0
Analytical Layer in PDT
Introduction

Input: morphologically tagged sentences

Graph Editor: “user-friendly” software

Output: ATS structure


„surface“ syntax tree structure
nodes labelled by the analytical functions
Analytical Functions








Pred
Sb
Obj
Adv
Atv
AtvV
Atr
Pnom

AuxV
Coord
Apos
AuxR

AuxT



- Predicate if it depends on the tree root
- Subject
- Object
- Adverbial
- Complement
- Complement, if one governor is present
- Attribute
- Nominal predicate‘s nominal part, depends on the
copula „to be“
- Auxiliary verb „to be“
- Coordination node
- Apposition node
- Reflexive particle, which is neither Obj nor AuxT
(passive)
- Reflexive particle, lexically bound to the verb
Analytical Functions











AuxP
AuxC
AuxO
AuxZ
- Preposition or a part of compound preposition
- Subordinate conjunction
- (Superfluously) referring particle or emotional particle
- Rhematizer or another node acting to another
constituent
AuxX
- Comma, but not the main coordinating comma
AuxG
- Other graphical symbols being not classified as AuxK
AuxY
- Other words, such as particles without a specific
syntactic function, parts of lexical idioms, etc.
AuxS
- Sentence holder (the only added root to the tree)
AuxK
- Punctuation at the end of the sentence
or direct speech or citation clause
ExD
- Ellipsis handling: functions for nodes which pseudo
depend on a node on which the would not
depend if there were no ellipsis
AtrAtr, AtrAdv, AdvAtr, AtrObj, ObjAtr + *_Co, *_Pa, *_Ap
Two stages (chronologically)

(A) manual „analytic“ annotation (ATS)


training data for (B)(a)
(B)


(a) semiautomatic procedure (Collin‘s parser)
(b) manual correcting of (B)(a)
Constraints and limitations

any string has a node of its own
 word-form, punctuation mark, etc.


AuxV, AuxP, AuxC, AuxX, AuxG…
reflecting the coordination and apposition relations
 so called third dimension of the graph in the plain tree
(X_Co, X_Ap, X_Pa, where X is one of analytic functions,
such as Sb, Obj, Adv, etc.)
Constraints and limitations

no missing nodes (on the surface) can be added
 analytic funtion Ex_D is used

relations between semi-automatic and manual procedure

80% edges are established correctly automatically
Project organization




team consisting of 5-6 annotators
handbook for ATS structure annotation
100000 sentences on ATS
tectogrammatical annotation follows
Projectivity/Nonprojectivity/Surface
Order

A(B, C)
A
A
B
C
B
A
C
B
C
Projectivity/Non-projectivity/Surface
Order

A(B( C ))
A
A
A
B
B
C
C
B
C
První restituční zákon českého parlamentu se do sněmovních
lavic může vrátit jako bumerang.
AuxT
Adv
Prague Dependency Treebank
1.0
From the Analytical
towards
the Tectogrammatical layer
Introduction

ATS annotation

nodes:




word forms
punctuation
graphical symbols

edges:

surface relations

deep layer functions
TGTS annotation


autosemantic words
deletions
Annotation process
Input
Czech
sentence
Tokenization
Morphological tagging
and lexical
disambiguation
ATS
Tree structure
pruning
Attribute
assignments
Syntactic parsing
and analytic function
assignment
PDT1.0
TGTS
Transition procedure

deterministic procedure operating on trees
macro language for Graph Editor (perl)
automatic changes & tools for annotators

Requirements





new attributes for tectogrammatical layer
ATS is recoverable from TGTS
automatized to a maximally high degree
New attributes

trlemma - lemma of the original node or lemma composed
of joined nodes

morphological grammatemes



gender, number, degree of comparison, tense,
aspect, iterativeness, verbal modality, deontic
modality, sentence modality
position of the node




functor, topic-focus articulation, syntactic grammateme,
type of relation (dependency, coordination, apposition),
phraseme, deletion, quoted word, direct speech,
coreference, antecedent
Tree Structure Pruning


U toho, kdo začíná opravdu od nuly, není daňový výnos pro stát
podstatný.
For those, who start actually at zero, the tax outcome for the
state is not substantial.
Tree Structure Pruning


U toho, kdo začíná opravdu od nuly, není daňový výnos pro stát
podstatný.
For those, who start actually at zero, the tax outcome for the state
is not substantial.
REG
Verbal Nodes
verbmod=CDN
deontmod=HRT
PRED
•… podnikatelé by měli mít daně …
•… enterpreneurs should have (their) taxes …
Attribute Assignments

prepositions stored as fw attribute

quoted words





clause in quotes -> DSP
one pair of quotes in the sentence -> DSPP
string in quotes -> QUOT
gender, number, tense, degcmp, aspect
default values
Macros for Annotators

keyboard shortcuts (in Graph editor)

structure changes




hide/recover nodes
merge nodes
add new nodes
functor assignments
Manual annotation




structure checking
functors
deletions of obligatory modifications
feedback for formulating the handbook for
annotators
Prague Dependency Treebank
1.0
Tectogrammatical Layer
F
T
C
T
T
T
T
T
T
F


Jirka se
včera
opil do němoty a Honza dneska.
George himself yesterday drank to silence and Honza today.
Attributes of Coreferrential relations

only in MC

attribute values
coref
the lemma of the antecedent
corsnt NIL - in the same sentence
PREV1 ... PREVi
- position of the sentence which
includes the antecedent

grammatical coreference
antec
the functor of the antecedent
Example
coref:
corsnt:
cornum:
antec:
Honza
Honza
Honza
NIL
1
ACT
slíbil
přijít včas.
promised to come in time.