Transcript ppt file

Prague Dependency
Treebank 1.0
CD-ROM PRESENTATION
Dec 18, 2000
Prague Dependency
Treebank 1.0
Functional Generative Description
CD-ROM PRESENTATION
Dec 18, 2000
Functional Generative Description
 theoretical framework based on the findings of European
structural linguistics, esp. of the classical Prague
School
 methodological requirements of a formal description
 levels:
tectogrammatical (underlying) representations (TRs) with
dependency based syntax
morphemics
phonemics and phonetics
 TRs (see Sgall, Hajičová and Panevová 1986, formally specified by
Petkevič, also in a declarative way)
Prague Dependency Treebank 1.0
Dependency tree
My younger brother arrived there yesterday.
Linearized form, one-to-one relation:
((I)Appurt (younger)Rstr brother)Act arrive.Pret.Indic (Dir there) (Temp yesterday)
Prague Dependency Treebank 1.0
Dependency Tree
 labels - lexical meanings (abstract symbols) with indices
functors
subscripts at parentheses oriented towards head
 grammatemes - values of morphological categories
 Tense, Modality, Number, Definiteness, etc.
 projectivity
 valency
arguments (inner participants) and
adjuncts (circumstantials or 'free modifications')
obligatory and optional with a given head,
deletable or not
Prague Dependency Treebank 1.0
Dependency Tree
participants
(arguments) of verbs
Actor/Bearer
(underlying subject)
Objective (Patient,
underlying direct object)
Addressee
(underlying indirect object)
Effect ('second' object: to
choose so. as sth.)
Origin
(to make sth. out of sth.)
adjuncts
Locative, several
Directional and
Temporal modifications
Condition, Means,
Manner, etc.
Prague Dependency Treebank 1.0
Dependency Tree
Complementations dependent mainly on nouns
inner participants
Material (Partitive)
two baskets of sth.
Identity
the river Danube; the
notion of operator
free modifications
Possession (Appurtenance)
my table; Jim's brother
Restrictive
rich man
Descriptive
the Swedes, who are a
Scandinavian nation
Prague Dependency Treebank 1.0
Dependency Tree
syntactic grammatemes
Loc, Dir - in, on, under, between...
 Regard - with, without
operational (testable) criteria
for distinguishing
arguments from adjuncts,
from each other
 deletability (dialogue test)
Prague Dependency Treebank 1.0
Simplified valency frames
read V Act Addr Obj
change
V Act Obj Orig Eff
give V Act Addr Obj
brother N Appurt
man N
glass N Material
full A Material
obligatory complementations in blue
Prague Dependency Treebank 1.0
Topic-focus articulation
T
there
 contextual boundness
main verb CB/NB (T/F)
dependents to the left/right
 communicative dynamism
left-right (mother, sisters,
transitive)
young
partial ordering
left-to-right order of nodes together
with the index T or (prototypically) F
indicates the TFA of the sentence
(of the TR)
 underlying word order
left-right
linear ordering
Prague Dependency Treebank 1.0
Topic-focus articulation
T
F
yesterday
there
young
TFA - one of the basic aspects of underlying
structures
Prague Dependency Treebank 1.0
Complex sentence
My brother, whom you know, arrived there yesterday.
a subordinated (dependent) clause (i.e. its main
verb) depends on a word contained in its
governing clause
Prague Dependency Treebank 1.0
Complex sentence
Martin came there late, since he had to accompany his sick mother.
 function words (synsemantic) are viewed as function
morphemes, syntactically fixed to certain lexical
(autosemantic) words - prepositions and articles to nouns,
conjunctions and auxiliaries to verbs
Prague Dependency Treebank 1.0
Complex sentence
Martin arrived late to the session, since he had to accompany his sick mother.
schematically (morphemes):
Martin arrive.ed late to the session since he have.ed to accompany
he.s sick mother.
dot - close connection of morphemes ('semes')
Prague Dependency Treebank 1.0
 deleted items restored
order of items - difference between 'underlying' and surface
(morphemic) word order
transductive components - Panevová, Oliva, Borota
 coordination (multidimensional)
 Jim and Mary, who have two children, went to Boston.
the linearized notation is adequate:
 ((Jim Mary)Conj ((who)Act have (Pat (two)Rstr
children)))Act went (Dir Boston)
 structures close to Boolean, i.e. no complex 'innate
properties' specific for natural language are needed.
Prague Dependency Treebank 1.0
Prague Dependency
Treebank - corpus annotation
an intermediate level - 'analytical'
representations
dependency trees, not always projective
nodes for all word tokens, even for
punctuation marks
tectogrammmatical tree: coordinating
conjunction as the head
Prague Dependency Treebank 1.0
Prague Dependency
Treebank 1.0
CD-ROM PRESENTATION
Dec 18, 2000
Prague Dependency
Treebank 1.0
Morphological Layer
CD-ROM PRESENTATION
Dec 18, 2000
ACKNOWLEDGEMENTS
Prague Dependency Treebank 1.0
ANNOTATED CORPORA
PDT version 1.0, 2000
(1996 - 2000)
Penn Treebank, release 3, 1999
(1989 - 1999)
Prague Dependency Treebank 1.0
TAG SETs
Czech - ambiguous inflective language
nový, nového, novému, novém, novým, nová, nové, novou,
nových, novým, novými, … novější, novejšího, novějšímu,
novějším, …., nejnovější, nejnovějšího, nejnovějšímu,
nejnovějším….. nejnovějších, nejnovějším, …
English - language with poor inflection
work, works, worked, working
Prague Dependency Treebank 1.0
Prague Dependency Treebank 1.0
TEXT SOURCES
 Lidové noviny
 ´88, ´89 WSJ articles
 Mladá Fronta Dnes
 Air Travel Information
System transcripts
 Vesmír
 Českomoravský
Profit
 Brown Corpus
 Switchboard transcripts
...taken from Czech
National Corpus
Prague Dependency Treebank 1.0
ANNOTATION STRATEGY Penn Treebank
TEXT
Ken Church‘s stochastic tagger,
Eric Brill‘s transformation tagger
corrections by annotator (GNU Emacs
Lisp based package)
Prague Dependency Treebank 1.0
ANNOTATION STRATEGY - PDT
Automatic Morphological Analyzer (AMA)
two independent annotators; Linux, Win tools
differences resolved by third annotator
comparison with the current AMA;
manual resolution; Win tools
Prague Dependency Treebank 1.0
INTERNAL FORMAT
 SGML coding, csts dtd
 word/tag(|tag)*
Prague Dependency Treebank 1.0
SAMPLES
<s id=“ln95040:020-p1s1“>
<f>Pokus<l>pokus<t>NNIS1-----A---<f>o<l>o<t>RR--4---------<f>zázrak<l>zázrak<t>NNIS4-----A---<d>.<l>.<t>Z:------------The/DT envelope/NN arrives/VBZ in/IN the/DT mail/NN ./.
Prague Dependency Treebank 1.0
CONVERSION
SGML coding
word/tag
pdt2wsj.pl
pdt2wsjFLT.pl
SGML coding
word/lemma/tag
Prague Dependency Treebank 1.0
DATA SIZE
# word
tokens
# sentences
PDT 1.0
1 730K
112K
Penn Treebank
4 600K
350K
release 3
Prague Dependency Treebank 1.0
DATA SETs of MORPHOLOGICALLY
ANNOTATED DATA
for tagging only
#tokens/sentences
training data
1 470K/95K
development test data
130K/8K
evaluation test data
127K/8K
for parsing (preprocessing step)
training data
475K/29K
development test data
130K/8K
evaluation test data
127K/8K
Prague Dependency Treebank 1.0
TOOLS
Automatic
Morphological
Analyser/Generator of
Czech
HMAnalyze.pl,
HMGenerate.pl
Dictionary: CZE_a
Remote Acces
Czech Taggers
HMM
Exponential
Prague Dependency Treebank 1.0
Prague Dependency
Treebank 1.0
CD-ROM PRESENTATION
Dec 18, 2000
Prague Dependency
Treebank 1.0
Analytical Layer in PDT
CD-ROM PRESENTATION
Dec 18, 2000
Introduction
Input: morphologically tagged sentences
Graph Editor: “user-friendly” software
Output: ATS structure
„surface“ syntax tree structure
nodes labelled by the analytical functions
Prague Dependency Treebank 1.0
Two stages (chronologically)
(A) manual „analytic“ annotation (ATS)
training data for (B)(a)
(B)
(a) semiautomatic procedure (Collin‘s parser)
(b) manual correcting of (B)(a)
Prague Dependency Treebank 1.0
Constraints and limitations
 any string has a node of its own
word-form, punctuation mark, etc.
AuxV, AuxP, AuxC, AuxX, AuxG…
 reflecting the coordination and apposition
relations
so called third dimension of the graph in the plain
tree (X_Co, X_Ap, X_Pa, where X is one of analytic
functions, such as Sb, Obj, Adv, etc.)
Prague Dependency Treebank 1.0
Constraints and limitations
 no missing nodes (on the surface) can be added
analytic funtion Ex_D is used
 relations between semi-automatic and manual
procedure
80% edges are established correctly automatically
Prague Dependency Treebank 1.0
Project organization
team consisting of 5-6 annotators
handbook for ATS structure annotation
1999: 100000 sentences on ATS
tectogrammatical annotation follows
Prague Dependency Treebank 1.0
První restituční zákon českého parlamentu se do sněmovních
lavic může vrátit jako bumerang.
AuxT
Adv
Prague Dependency Treebank 1.0
Prague Dependency
Treebank 1.0
CD-ROM PRESENTATION
Dec 18, 2000
Prague Dependency
Treebank 1.0
From the Analytical
towards
the Tectogrammatical layer
CD-ROM PRESENTATION
Dec 18, 2000
Introduction
ATS annotation
nodes:
edges:
word forms
punctuation
graphical symbols
surface relations
TGTS annotation
autosemantic
words
deletions
deep layer functions
Prague Dependency Treebank 1.0
Annotation process
Input
Czech
sentence
Tokenization
Morphological tagging
and lexical
disambiguation
ATS
Tree structure
pruning
Syntactic parsing
and analytic function
assignment
PDT1.0
Attribute
assignments
Prague Dependency Treebank 1.0
TGTS
Transition procedure
deterministic procedure operating on trees
macro language for Graph Editor (C++ like)
automatic changes & tools for annotators
Requirements
new attributes for tectogrammatical layer
ATS is recoverable from TGTS
automatized to a maximally high degree
Prague Dependency Treebank 1.0
New attributes
trlemma -
lemma of the original node or lemma
composed of joined nodes
morphological grammatemes
gender, number, degree of comparison, tense,
aspect, iterativeness, verbal modality, deontic
modality, sentence modality
position of the node
functor, topic-focus articulation, syntactic grammateme,
type of relation (dependency, coordination, apposition),
phraseme, deletion, quoted word, direct speech,
coreference, antecedent
Prague Dependency Treebank 1.0
Tree Structure Pruning
U toho, kdo začíná opravdu od nuly, není daňový výnos
pro stát podstatný.
For those, who start actually at zero, the tax outcome for
the state is not substantial.
Prague Dependency Treebank 1.0
Tree Structure Pruning
U toho, kdo začíná opravdu od nuly, není daňový výnos pro
stát podstatný.
For those, who start actually at zero, the tax outcome for the
state is not substantial.
REG
Prague Dependency Treebank 1.0
Verbal Nodes
verbmod=CDN
deontmod=HRT
PRED
•… podnikatelé by měli mít daně …
•… enterpreneurs should have (their) taxes …
Prague Dependency Treebank 1.0
Attribute Assignments
prepositions stored as fw attribute
quoted words
clause in quotes -> DSP
one pair of quotes in the sentence -> DSPP
string in quotes -> QUOT
gender, number, tense, degcmp, aspect
default values
Prague Dependency Treebank 1.0
Macros for Annotators
keyboard shortcuts (in Graph editor)
structure changes
hide/recover nodes
merge nodes
add new nodes
functor assignments
Prague Dependency Treebank 1.0
Manual annotation
structure checking
functors
deletions of obligatory modifications
feedback for formulating the handbook for
annotators
Prague Dependency Treebank 1.0
Prague Dependency
Treebank 1.0
CD-ROM PRESENTATION
Dec 18, 2000
Prague Dependency
Treebank 1.0
Tectogrammatical Layer
CD-ROM PRESENTATION
Dec 18, 2000
Prague Dependency Treebank 1.0
F
T
C
T
T
T
T
T
T
Prague Dependency Treebank 1.0
F
 Jirka
se
včera
opil do němoty a Honza dneska.
 George himself yesterday drank to silence and Honza today.
Prague Dependency Treebank 1.0
Attributes of Coreferrential
relations
only in MC
attribute
coref
corsnt
values
the lemma of the antecedent
NIL - in the same sentence
PREV1 ... PREVi
- position of the sentence which
includes the antecedent
grammatical coreference
antec
the functor of the antecedent
Prague Dependency Treebank 1.0
Example
coref:
corsnt:
cornum:
antec:
Honza
Honza
Honza
NIL
1
ACT
slíbil
přijít včas.
promised to come in time.
Prague Dependency Treebank 1.0