Transcript Document
Treebanks and MWEs
(Part 1)
Jan Hajič, Pavel Straňák, Jiří Mírovský
Institute of Formal and Applied Linguistics & LINDAT/CLARIN
School of Computer Science
Faculty of Mathematics and Physics
Charles University in Prague
Czech Republic
Outline
• Treebanks
– Phrase-(Constituency-) based: The Penn Treebank
– Dependency: The Prague Dependency Treebanks
• The Penn Treebank (basics)
• The Prague Dependency Treebank
– Layers of Annotation
•
•
•
•
19.1.2015
Morphology
Syntax
Semantics
Valency
PARSEME Training School Prague
2
THE PENN TREEBANK
19.1.2015
PARSEME Training School Prague
3
Phrase- vs. DependencyBased Treebanks
• The original: The Penn Treebank
– Phrase-based style; good for parsing by CFG grammars
• Followers
– Almost all Penn-based treebanks
• Chinese, Arabic, Korean, …
– Negra (German), many others
• Now: dependency parsing prevails
• Conversion from phrase-based treebanks
– Might lose information, heads added „ad hoc“
• “native” dependency treebanks: annotated as such
– Considered “better”
– Hindi/Urdu, TIGER (sort of); both styles manually annotated
– PDT (of course) and similar ones
» PDT style treebanks: Danish, Croatian, Slovene, Greek, Latin
19.1.2015
PARSEME Training School Prague
4
The Penn Treebank
• Published (first) in 1993, now LDC99T42 (www.ldc.upenn.edu)
– First the Wall Street Journal part (1 mil. words, 2312 documents)
• Added other text types
– ATIS corpus (dialogs, travel reservations)
– Brown corpus annotated for syntax
– Switchboard (spoken language, tel. conversations)
19.1.2015
PARSEME Training School Prague
5
Penn Treebank Format
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
( (S
(NP-SBJ
(NP (NNP Pierre) (NNP Vinken) )
(, ,)
(ADJP
(NP (CD 61) (NNS years) )
(JJ old) )
(, ,) )
(VP (MD will)
(VP (VB join)
(NP (DT the) (NN board) )
(PP-CLR (IN as)
(NP (DT a) (JJ nonexecutive) (NN director) ))
(NP-TMP (NNP Nov.) (CD 29) )))
(. .) ))
“Preterminal”
POS tag (NNS)
(noun, plural)
Noun Phrase
Phrase label (NP)
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
19.1.2015
PARSEME Training School Prague
6
The Penn Treebank(s)
• Extensions
– Annotation of named entities, co-reference (BBN)
– cf. also previous slides
– Function labels (SBJ, OBJ, TMP, ...)
– PropBank
• Penn Treebank syntax + Predicate-argument relations,
added “frame files” (predicate dictionary)
(S (NP-SBJ (PRP I Arg0) VP (VBD gave Pred) (NP-DOBJ (PRP him Arg1)
(NP-IOBJ (DET the) (NN book Arg2))) ... )
– NomBank
• Like PropBank, but for nouns and their “arguments”
• Other languages (Chinese, Arabic, ...)
19.1.2015
PARSEME Training School Prague
7
THE PRAGUE DEPENDENCY
TREEBANK
19.1.2015
PARSEME Training School Prague
8
The Prague Dependency
Treebanks: the Basics
• Original Treebank: PDT 1.0, 2001 (morf., dep. syntax)
• First full release: PDT 2.0
– http://ufal.mff.cuni.cz/pdt2.0
• LDC2006T01, see http://www.ldc.upenn.edu
– Now: PDT 3.0: http://ufal.mff.cuni.cz/pdt3.0
• Basic general features
–
–
–
–
–
Multilayered annotation, interlinked layers
Dependency-based syntax (both surface and deep)
Information structure of the sentence (topic/focus)
Grammatical and basic textual coreference
New: discourse relations, MWEs
• Languages: Czech, English (also parallel), Arabic
– Student work on “samples”: Indonesian, Urdu, Russian, …
– Spoken: work started on Czech and English (non-parallel, dialogs)
19.1.2015
PARSEME Training School Prague
9
The Prague
Dependency
Treebank
• Three basic layers of annotation
– Morphemic layer
– Surface syntax (“analytical”) layer
– “Tectogrammatical” layer:
underlying syntax, semantic roles
(valency), inf. structure, coreference
• Size
– 830,000 words (tokens)
= 50000 sentences in 3165 full
documents (texts)
•
Format
– Prague Markup Language (XMLbased)
– Now also: .treex format
• For smooth uise in the TreeX platform
• http://ufal.mff.cuni.cz/treex
19.1.2015
PARSEME Training School Prague
10
PDT (Czech) Data
• 4 sources:
–
–
–
–
Lidové noviny (daily newspaper, incl. extra sections)
DNES (Mladá fronta Dnes) (daily newspaper)
Vesmír (popular science magazine, monthly)
Českomoravský Profit (economical journal, weekly)
• Full articles selected
– article ~ DOCUMENT (basic corpus unit)
• Time period: 1990-1995
• 1.8 million tokens (~110,000 sentences total)
19.1.2015
PARSEME Training School Prague
11
PDT Annotation Layers
• L0 (w) Words (tokens)
(2006)
2.01.0
PDT
(2001)
PDT
– automatic segmentation and markup only
• L1 (m) Morphology
– Tag (full morphology, 13 categories), lemma
• L2 (a) Analytical layer (surface syntax)
– Dependency, analytical dependency function
• L3 (t) Tectogrammatical layer (“deep” syntax)
– Dependency, “functor”, grammatemes, ellipsis
solution, coreference, topic/focus (deep word order),
valency lexicon; PDT 3.0: mass, clauses, formemes,
discourse, ...
19.1.2015
PARSEME Training School Prague
12
PDT Annotation Layers
• L0 (w) Words (tokens)
– automatic segmentation and markup only
• L1 (m) Morphology
– Tag (full morphology, 13 categories), lemma
• L2 (a) Analytical layer (surface syntax)
– Dependency, analytical dependency function
• L3 (t) Tectogrammatical layer (“deep” syntax)
– Dependency, functor (detailed), grammatemes,
ellipsis solution, coreference, topic/focus (deep word
order), valency lexicon
19.1.2015
PARSEME Training School Prague
13
Morphological Attributes
Tag: 13 categories
Ex.: nejnezajímavějším
“(to) the most uninteresting”
Example: AAFP3----3N---Adjective
Regular
Feminine
Plural
Dative
no poss. Gender
no poss. Number
no person
no tense
superlative
negated
no voice
reserve1
reserve2
base var.
Lemma: POS-unique identifier
Books/verb -> book-1, went -> go, to/prep. -> to-1
19.1.2015
PARSEME Training School Prague
14
PDT Annotation Layers
• L0 (w) Words (tokens)
– automatic segmentation and markup only
• L1 (m) Morphology
– Tag (full morphology, 13 categories), lemma
• L2 (a) Analytical layer (surface syntax)
– Dependency, analytical dependency function
• L3 (t) Tectogrammatical layer (“deep” syntax)
– Dependency, functor (detailed), grammatemes,
ellipsis solution, coreference, topic/focus (deep word
order), valency lexicon
19.1.2015
PARSEME Training School Prague
15
Layer 2 (a-layer):
Analytical Syntax
• Dependency + Analytical Function
governor
dependent
The influence of the Mexican
crisis on Central and Eastern
Europe has apparently
been underestimated.
19.1.2015
PARSEME Training School Prague
16
Analytical Syntax:
Functions
• Main (for [main] semantic lexemes):
• Pred, Sb, Obj, Adv, Atr, Atv(V), AuxV, Pnom
• “Double” dependency: AtrAdv, AtrObj, AtrAtr
• Special (function words, punctuation,...):
• Reflexives, particles: AuxT, AuxR, AuxO, AuxZ, AuxY
• Prepositions/Conjunctions: AuxP, AuxC
• Punctuation, Graphics: AuxX, AuxS, AuxG, AuxK
• Structural
• Elipsis: ExD, Coordination etc.: Coord, Apos
19.1.2015
PARSEME Training School Prague
17
PDT Annotation Layers
• L0 (w) Words (tokens)
– automatic segmentation and markup only
• L1 (m) Morphology
– Tag (full morphology, 13 categories), lemma
• L2 (a) Analytical layer (surface syntax)
– Dependency, analytical dependency function
• L3 (t) Tectogrammatical layer (“deep” syntax)
– Dependency, functor (detailed), grammatemes,
ellipsis solution, coreference, topic/focus (deep word
order), valency lexicon
19.1.2015
PARSEME Training School Prague
18
Tectogrammatical
Annotation
• Underlying (deep) syntax
• 5 sublayers (integrated and/or standoff annotation):
– dependency structure, (detailed) functors
• valency annotation
– topic/focus and deep word order
– coreference (mostly grammatical only)
– discourse
– all the rest (grammatemes):
• detailed functors
• underlying gender, number, mass nouns, ...
• Total: 39 attributes (vs. 5 at m-layer, 2 at a-layer)
19.1.2015
PARSEME Training School Prague
19
Tectogrammatical vs.
analytical syntax
AR: All words
Predicate verb
“Location”
TR: No
function words
Re-inserted elided actor
of “making”
In practice, that procedure will require making of certified copies.
19.1.2015
PARSEME Training School Prague
20
Dependency Structure
• Similar to the surface (Analytical) layer...
...but:
– certain nodes deleted
• auxiliaries, non-autosemantic words, punctuation
• (some) multiword expressions -> 1 node
– some nodes added
• based on word (mostly verb, noun) valency
• some ellipsis resolution
– detailed dependency relation labels (functors)
19.1.2015
PARSEME Training School Prague
21
Tectogrammatical
Functors
“syntactic”
semantic
• “Actants”: ACT, PAT, EFF, ADDR, ORIG
– modify: verbs, nouns, adjectives
– cannot repeat in a clause, usually obligatory
• Free modifications (~ 50), semantically defined
– can repeat; optional, sometimes obligatory
– Ex.: LOC, DIR1, ...; TWHEN, TTILL,...; RSTR; BEN, ATT,
ACMP, INTT, MANN; MAT, APP; ID, DPHR, CPHR, ...
• Special
– Coordination, Rhematizers, Foreign phrases (#Forn),...
19.1.2015
PARSEME Training School Prague
22
Deep Word Order
Topic/Focus
• Example:
Analytical
dep. tree:
• Baker bakes rolls.
19.1.2015
PARSEME Training School Prague
vs. BakerIC bakes rolls.
23
Deep Word Order
Topic/Focus
• Deep word order:
– from “old” information to the “new” one (left-toright) at every level (head included)
– projectivity by definition (almost...)
• i.e., partial level-based order -> total d.w.o.
• Topic/focus/contrastive topic
– attribute of every node (t, f, c)
– restricted by d.w.o. and other constraints
19.1.2015
PARSEME Training School Prague
24
Coreference
• Grammatical (easy)
– relative clauses
• which, who
promise
PRED
– Peter and Paul, who ...
– control
• infinitival constructions
– John promised to go home
– reflexive pronouns
• {him,her,thme}self(-ves)
go
PAT
John
ACT
he
ACT
home
DIR3
– Mary saw herself in ...
19.1.2015
PARSEME Training School Prague
25
Coreference
• Textual
– Ex.: Peter moved to Iowa after he finished his PhD.
move
PRED
Peter
ACT
finish
TWHEN
Iow a
DIR1
PhD
PAT
he
ACT
he
APP
19.1.2015
PARSEME Training School Prague
26
Grammatemes
• Detailed functors (“subfunctors”)
– needed for some functors:
• TWHEN: before/after
• LOC: next-to, behind, in-front-of, ...
• also: ACMP, BEN, CPR, DIR1, DIR2, DIR3, EXT
• Lexical (underlying)
– number (Sg/Pl), tense, modality, degree of
comparison, mass-noun?; is_person_name,
is_dsp_root, ...
19.1.2015
PARSEME Training School Prague
27
VALENCY IN PDT
19.1.2015
PARSEME Training School Prague
30
Prague Dependency
Treebank & Valency
• Valency in the PDT
– Valency lexicon for PDT
– General valency lexicon
• Valency in deep vs. surface syntax
– Links between the layers w.r.t. valency
• Valency and word sense
– Sense-disambiguated occurrences:
• Links from data to the lexicon
• Valency in translation, text generation
19.1.2015
PARSEME Training School Prague
31
Definition of Valency
• Ability (“desire”) of words (verbs, nouns,
adjectives) to combine themselves with other
units of meaning
• Properties of valency:
– Specific for every word meaning (in general)
• leave: sb left sth for sb vs. sb left from somewhere
• similar to PropBank leave.02 vs. leave.01
– Typically strongly correlates with surface form (Czech)
• morphological case (~ ending), preposition+case, ...)
– Semantic constraints
19.1.2015
PARSEME Training School Prague
32
Structure of Valency
• word (lemma)
– word sense group 1
• valency frame:
– slot1 slot2 slot3
• surface expression
– word sense group 2
• ...
19.1.2015
PARSEME Training School Prague
vyměnit (to replace)
vyměnit1
ACT PAT EFF
Nom. Acc. za+Acc.
vyměnit2
...
33
PDT-Vallex Entry
• dosáhnout: “to reach”, “to get [sb to do sth]”
• browser/user-formatted example:
19.1.2015
PARSEME Training School Prague
34
MWEs in PDT-Vallex
• Types included:
– Reflexive particle (se, si)
• smát se – to laugh
• všimnout si – to notice
– Idiomatic constructions
• dosáhnout svého - to achieve one’s goals
• běhá mi mráz po zádech – to give me the shivers
– Light verb constructions (and similar)
• uzavřit dohodu – to agree [on sth], strike an
agreement, ...
• vzbuzovat pochybnosti – to doubt, to raise doubts
19.1.2015
PARSEME Training School Prague
35
Corpus ↔ Valency
Lexicon
• Corpus:
Sentence 2035:
Lexicon:
19.1.2015
Sentence 15345:
Sentence 51042:
ENTRY: uzavřít (to close)
vf1: ACT(.1) CPHR({smlouva}.4)
ex: u. dohodu (close a contract)
vf2: ACT(.1) PAT(.4)
ex.: u. pokoj (close a room, house)
PARSEME Training School Prague
36
Valency & Text
Generation
• Using valency for...
– ...getting the correct (lemma, tag) of verb arguments
• Example:
VALLEX
entry: starat (se) ACT(.1) PAT(o.[.4])
starat
V..............
starat_se
“to take care of” PRED
Martin
ACT
tygr
PAT “tiger”
Martin
se
....1..........
...............
Martin
19.1.2015
PARSEME Training School Prague
o
...............
“Martin
tygr
....4.......... takes
se stará o tygry.
care of
tigers.”
37
PARALLEL TREEBANK CZ-EN
19.1.2015
PARSEME Training School Prague
38
Parallel Czech-English
Annotation
• English text → Czech text (human translation)
• Czech side (goal): all layers manual annotation
• English side (goal):
– Morphology and surface syntax: technical conversion
• Penn Treebank style -> PDT Analytic layer
– Tectogrammatical annotation: manual annotation
• (Slightly) different rules needed for English
• Alignment
– Natural, sentence level only (now)
19.1.2015
PARSEME Training School Prague
39
English Annotation
POS and Syntax
• Automatic conversion from Penn Treebank
– PDT morphological layer
• From POS tags
– PDT analytic layer
• From:
– Penn Treebank Syntactic Structure
– Non-terminal labels
– Function tags (non-terminal “suffixes”)
• 2-step process
– Head determination rules
– Conversion to dependency + analytic function
19.1.2015
PARSEME Training School Prague
41
Czech-English Example
Dicku Darmane, zavolejte do své kanceláře!
19.1.2015
PARSEME Training School Prague
Dick Darman, call your office!
46
SUMMARY OF PART 1/1
19.1.2015
PARSEME Training School Prague
47
PDT Treebanks at UFAL
(written language)
• Czech
– Prague Dependency Treebank
• Complex annotation, all levels, additional annotation
– Translation of Penn Treebank
• Tectogrammatical layer only, no t/f
– Analytical, morphology: automatic tool
• English
– Re-annotation of Penn Treebank
• Other languages
– Arabic (own annotation)
– Other: by conversion (HamleDT – 30 treebanks)
19.1.2015
PARSEME Training School Prague
48
Prague Dependency
Treebanks
• Annotation:
– 4 layers:
• Words, lemmas/tags, surface dep. syntax,
tectogrammatics
– Tectogrammatical layer:
• No function words, semantic relations
• Valency/verb arguments (some MWE features)
– Separate valency lexicon, fully linked from PDT nodes
• Coreference, Topic/focus, Discourse
• Links back to analytical layer (parsing!)
19.1.2015
PARSEME Training School Prague
49
Pointers
•
PDT 2.0 (the “Original”), newest version: PDT 3.0
– http://ufal.mff.cuni.cz/pdt2.0
– http://ufal.mff.cuni.cz/pdt3.0
•
PCEDT
– http://ufal.mff.cuni.cz/pcedt2.0/
•
PEDT
– English side of PCEDT, additional: NE, coreference
– http://ufal.mff.cuni.cz/pedt2.0/
•
PADT (Arabic, morphology + surface syntax)
– http://ufal.mff.cuni.cz/padt
•
Other corpora, PDT-Vallex, EngVallex:
–
•
Search at http://lindat.cz
LDC catalog numbers:
– LDC2006T01 (PDT 2.0), LDC2004T23 (PADT 1.0), LDC2004T25 (PEDT 1.0)
•
CoNLL 2009 shared task (7 languages, surface syntax + predicate
arguments only)
– http://ufal.mff.cuni.cz/conll2009-st
• HamleDT 2.0 (30 treebanks in unified format)
– http://ufal.mff.cuni.cz/hamledt
19.1.2015
PARSEME Training School Prague
50