Prague Dependency Treebank 1.0 Functional Generative Description
Download
Report
Transcript Prague Dependency Treebank 1.0 Functional Generative Description
Prague Dependency
Treebank 1.0
Functional Generative Description
Functional Generative Description
theoretical framework based on the findings of European
structural linguistics, esp. of the classical Prague School
methodological requirements of a formal description
levels:
tectogrammatical (underlying) representations (TRs) with
dependency based syntax
morphemics
phonemics and phonetics
TRs (see Sgall, Hajičová and Panevová 1986, formally specified by
Petkevič, also in a declarative way)
The Language Layers
Phonemic,
Morphonological,
Morphemic,
Analytical (surface syntax)
Tectogrammatical (deep syntax).
Dependency tree
My younger brother arrived there yesterday.
Linearized form, one-to-one relation:
((I)Appurt (younger)Rstr brother)Act arrive.Pret.Indic (Dir there) (Temp yesterday)
Dependency Tree
labels - lexical meanings (abstract symbols) with indices
functors
grammatemes - values of morphological categories
subscripts at parentheses oriented towards head
Tense, Modality, Number, Definiteness, etc.
projectivity
valency
arguments (inner participants) and
adjuncts (circumstantials or 'free modifications')
obligatory and optional with a given head,
deletable or not
Dependency Tree
Arguments/participan
ts of verbs
Actor/Bearer
(underlying subject)
Objective (Patient,
underlying direct object)
Addressee
(underlying indirect object)
Effect ('second' object: to
choose so. as sth.)
Origin
(to make sth. out of sth.)
Adjuncts
Locative, several
Directional and Temporal
modifications
Condition, Means,
Manner, etc.
Dependency Tree
Complementations dependent mainly on nouns
Arguments (inner
participants)
Material (Partitive)
two baskets of sth.
Identity
the river Danube; the
notion of operator
Adjuncts (free
modifications)
Possession
(Appurtenance)
my table; Jim's brother
Restrictive
rich man
Descriptive
the Swedes, who are a
Scandinavian nation
Dependency Tree
syntactic grammatemes
Loc, Dir - in, on, under, between...
Regard - with, without
operational (testable) criteria
for distinguishing
arguments from adjuncts,
from each other
deletability (dialogue test)
Simplified valency frames
read V Act Addr Obj
change
V Act Obj Orig Eff
brother N Appurt
man N
glass N Material
full A Material
give V Act Addr Obj
obligatory complementations in blue
Topic-focus articulation
T
contextual boundness
there
young
left-to-right order of nodes together
with the index T or (prototypically) F
indicates the TFA of the sentence
(of the TR)
main verb CB/NB (T/F)
dependents to the left/right
communicative dynamism
left-right (mother, sisters,
transitive)
partial ordering
underlying word order
left-right
linear ordering
Topic-focus articulation
T
there
F
yesterday
young
TFA - one of the basic aspects of underlying structures
Complex sentence
My brother, whom you know, arrived there yesterday.
a subordinated (dependent) clause (i.e. its main verb)
depends on a word contained in its governing clause
Complex sentence
Martin came there late, since he had to accompany his sick mother.
function words (synsemantic) are viewed as function
morphemes, syntactically fixed to certain lexical (autosemantic)
words - prepositions and articles to nouns, conjunctions and
auxiliaries to verbs
Complex sentence
Martin arrived late to the session, since he had to accompany his sick mother.
schematically (morphemes):
Martin arrive.ed late to the session since he have.ed to accompany
he.s sick mother.
dot - close connection of morphemes ('semes')
deleted items restored
order of items - difference between 'underlying' and surface
(morphemic) word order
transductive components - Panevová, Oliva, Borota
coordination (multidimensional)
Jim and Mary, who have two children, went to Boston.
the linearized notation is adequate:
((Jim Mary)Conj ((who)Act have (Pat (two)Rstr children)))Act
went (Dir Boston)
structures close to Boolean, i.e. no complex 'innate properties'
specific for natural language are needed.
Prague Dependency Treebank corpus annotation
an intermediate level - 'analytical'
representations
dependency trees, not always projective
nodes for all word tokens, even for punctuation
marks
tectogrammmatical tree: coordinating
conjunction as the head
Prague Dependency
Treebank 1.0
Morphological Layer
ANNOTATED CORPORA
PDT version 1.0, 2000
(1996 - 2000)
(currently) ver. 2
Penn Treebank, release 3, 1999
(1989 - 1999)
PropBank (currently)
The Levels in PDT
Morphemic
Analytical
Tectogrammatical
TAG SETs
Czech - ambiguous inflective language
nový, nového, novému, novém, novým, nová, nové, novou, nových,
novým, novými, … novější, novejšího, novějšímu, novějším, ….,
nejnovější, nejnovějšího, nejnovějšímu, nejnovějším….. nejnovějších,
nejnovějším, …
English - language with poor inflection
work, works, worked, working
TEXT SOURCES
Lidové noviny
´88, ´89 WSJ articles
Mladá Fronta Dnes
Air Travel Information
Vesmír
Českomoravský
System transcripts
Profit
...taken from Czech
National Corpus
Brown Corpus
Switchboard transcripts
ANNOTATION STRATEGY Penn Treebank
TEXT
Ken Church‘s stochastic tagger,
Eric Brill‘s transformation tagger
corrections by annotator (GNU Emacs
Lisp based package)
ANNOTATION STRATEGY - PDT
Automatic Morphological Analyzer (AMA)
two independent annotators; Linux, Win tools
differences resolved by third annotator
comparison with the current AMA;
manual resolution; Win tools
INTERNAL FORMAT
SGML coding, csts dtd
word/tag(|tag)*
SAMPLES
<s id=“ln95040:020-p1s1“>
<f>Pokus<l>pokus<t>NNIS1-----A---<f>o<l>o<t>RR--4---------<f>zázrak<l>zázrak<t>NNIS4-----A---<d>.<l>.<t>Z:------------The/DT envelope/NN arrives/VBZ in/IN the/DT mail/NN ./.
CONVERSION
SGML coding
word/tag
pdt2wsj.pl
pdt2wsjFLT.pl
SGML coding
word/lemma/tag
DATA SIZE
# word
tokens
# sentences
PDT 1.0
1 730K
112K
Penn Treebank
4 600K
350K
release 3
DATA SETs of MORPHOLOGICALLY
ANNOTATED DATA
for tagging only
training data
#tokens/sentences
1 470K/95K
development test data
130K/8K
evaluation test data
127K/8K
for parsing (preprocessing step)
training data
475K/29K
development test data
130K/8K
evaluation test data
127K/8K
TOOLS
Automatic
Morphological
Analyser/Generator of
Czech
HMAnalyze.pl,
HMGenerate.pl
Dictionary: CZE_a
Remote Access
Czech Taggers
HMM
Exponential
Prague Dependency Treebank
1.0
Analytical Layer in PDT
Introduction
Input: morphologically tagged sentences
Graph Editor: “user-friendly” software
Output: ATS structure
„surface“ syntax tree structure
nodes labelled by the analytical functions
Analytical Functions
Pred
Sb
Obj
Adv
Atv
AtvV
Atr
Pnom
AuxV
Coord
Apos
AuxR
AuxT
- Predicate if it depends on the tree root
- Subject
- Object
- Adverbial
- Complement
- Complement, if one governor is present
- Attribute
- Nominal predicate‘s nominal part, depends on the
copula „to be“
- Auxiliary verb „to be“
- Coordination node
- Apposition node
- Reflexive particle, which is neither Obj nor AuxT
(passive)
- Reflexive particle, lexically bound to the verb
Analytical Functions
AuxP
AuxC
AuxO
AuxZ
- Preposition or a part of compound preposition
- Subordinate conjunction
- (Superfluously) referring particle or emotional particle
- Rhematizer or another node acting to another
constituent
AuxX
- Comma, but not the main coordinating comma
AuxG
- Other graphical symbols being not classified as AuxK
AuxY
- Other words, such as particles without a specific
syntactic function, parts of lexical idioms, etc.
AuxS
- Sentence holder (the only added root to the tree)
AuxK
- Punctuation at the end of the sentence
or direct speech or citation clause
ExD
- Ellipsis handling: functions for nodes which pseudo
depend on a node on which the would not
depend if there were no ellipsis
AtrAtr, AtrAdv, AdvAtr, AtrObj, ObjAtr + *_Co, *_Pa, *_Ap
Two stages (chronologically)
(A) manual „analytic“ annotation (ATS)
training data for (B)(a)
(B)
(a) semiautomatic procedure (Collin‘s parser)
(b) manual correcting of (B)(a)
Constraints and limitations
any string has a node of its own
word-form, punctuation mark, etc.
AuxV, AuxP, AuxC, AuxX, AuxG…
reflecting the coordination and apposition relations
so called third dimension of the graph in the plain tree
(X_Co, X_Ap, X_Pa, where X is one of analytic functions,
such as Sb, Obj, Adv, etc.)
Constraints and limitations
no missing nodes (on the surface) can be added
analytic funtion Ex_D is used
relations between semi-automatic and manual procedure
80% edges are established correctly automatically
Project organization
team consisting of 5-6 annotators
handbook for ATS structure annotation
100000 sentences on ATS
tectogrammatical annotation follows
Projectivity/Nonprojectivity/Surface
Order
A(B, C)
A
A
B
C
B
A
C
B
C
Projectivity/Non-projectivity/Surface
Order
A(B( C ))
A
A
A
B
B
C
C
B
C
První restituční zákon českého parlamentu se do sněmovních
lavic může vrátit jako bumerang.
AuxT
Adv
Prague Dependency Treebank
1.0
From the Analytical
towards
the Tectogrammatical layer
Introduction
ATS annotation
nodes:
word forms
punctuation
graphical symbols
edges:
surface relations
deep layer functions
TGTS annotation
autosemantic words
deletions
Annotation process
Input
Czech
sentence
Tokenization
Morphological tagging
and lexical
disambiguation
ATS
Tree structure
pruning
Attribute
assignments
Syntactic parsing
and analytic function
assignment
PDT1.0
TGTS
Transition procedure
deterministic procedure operating on trees
macro language for Graph Editor (perl)
automatic changes & tools for annotators
Requirements
new attributes for tectogrammatical layer
ATS is recoverable from TGTS
automatized to a maximally high degree
New attributes
trlemma - lemma of the original node or lemma composed
of joined nodes
morphological grammatemes
gender, number, degree of comparison, tense,
aspect, iterativeness, verbal modality, deontic
modality, sentence modality
position of the node
functor, topic-focus articulation, syntactic grammateme,
type of relation (dependency, coordination, apposition),
phraseme, deletion, quoted word, direct speech,
coreference, antecedent
Tree Structure Pruning
U toho, kdo začíná opravdu od nuly, není daňový výnos pro stát
podstatný.
For those, who start actually at zero, the tax outcome for the
state is not substantial.
Tree Structure Pruning
U toho, kdo začíná opravdu od nuly, není daňový výnos pro stát
podstatný.
For those, who start actually at zero, the tax outcome for the state
is not substantial.
REG
Verbal Nodes
verbmod=CDN
deontmod=HRT
PRED
•… podnikatelé by měli mít daně …
•… enterpreneurs should have (their) taxes …
Attribute Assignments
prepositions stored as fw attribute
quoted words
clause in quotes -> DSP
one pair of quotes in the sentence -> DSPP
string in quotes -> QUOT
gender, number, tense, degcmp, aspect
default values
Macros for Annotators
keyboard shortcuts (in Graph editor)
structure changes
hide/recover nodes
merge nodes
add new nodes
functor assignments
Manual annotation
structure checking
functors
deletions of obligatory modifications
feedback for formulating the handbook for
annotators
Prague Dependency Treebank
1.0
Tectogrammatical Layer
F
T
C
T
T
T
T
T
T
F
Jirka se
včera
opil do němoty a Honza dneska.
George himself yesterday drank to silence and Honza today.
Attributes of Coreferrential relations
only in MC
attribute values
coref
the lemma of the antecedent
corsnt NIL - in the same sentence
PREV1 ... PREVi
- position of the sentence which
includes the antecedent
grammatical coreference
antec
the functor of the antecedent
Example
coref:
corsnt:
cornum:
antec:
Honza
Honza
Honza
NIL
1
ACT
slíbil
přijít včas.
promised to come in time.