Prague Arabic Dependency Treebank

Download Report

Transcript Prague Arabic Dependency Treebank

Prague Arabic Dependency
Treebank
Development in Data
and Tools
Faculty of Mathematics and Physics
Faculty of Philosophy and Arts
Charles University in Prague
Jan Hajič
Otakar Smrž
Petr Zemánek
Jan Šnaidauf
Emanuel Beška
Project Release – PADT 1.0
 December 2004, Linguistic Data
Consortium
 148 000 Morpho, 113 500 Syntax
AFP
13 000
N/A
France Presse
Penn ATB 1
UMH
38 500
N/A
Ummah Press
Penn ATB 2
XIN
13 500
N/A
Xinhua News
A Gigaword
ALH
10 000 73 500 Al-Hayat News
ANN
12 500 25 500 An-Nahar News A Gigaword
XIA
26 500 49 500 Xinhua News
September 23, 2004
Prague Arabic Dependency Treebank:
Development in Data and Tools
A Gigaword
A Gigaword
2
Open-Source Tools
 TrEd Tree Editor
 Multi-purpose annotation environment
 Suite of programming utilities
 Netgraph Search Engine
 Server/Client system architecture
 Easy-to-learn query language
 Encode::Arabic Perl Module
 Extension for processing of Arabic script
 ArabTeX, Buckwalter, Unicode, …
September 23, 2004
Prague Arabic Dependency Treebank:
Development in Data and Tools
3
PADT Functional Views
 Functional Generative Description
 Theory of linguistic meaning and its expression
 Prague Dependency Treebank for Czech
 Independence of representation levels
 Tectogrammatical – linguistic meaning
 Analytical – surface dependency syntax
 Morphological – categories and lexical units
 Abstraction of the relations across levels
 Strict distinction between form and function
 Different units of description on each level
September 23, 2004
Prague Arabic Dependency Treebank:
Development in Data and Tools
4
Functional Morphology
 Provides syntax levels with their abstract
language, not just giving letters in tokens
 Revives multiple senses of categories
 Completeness of generation
 Strict modeling of grammatical control
 MorphoTrees – ‘human tagging’
 Successful prototype feature-based tagger
September 23, 2004
Prague Arabic Dependency Treebank:
Development in Data and Tools
5
Syntactic Levels of Description
 Analytical level
 Pragmatically motivated, close to surface syntax
 Every single token resulting from
morphological level forms one node
 Tree-like dependency structure for every sentence
 Tectogrammatical level




Linguistic (literal) meaning, deep relations, TFA
Initial structures transformed from AL
Nodes for autosemantic words only
Decisive role of valency frames
September 23, 2004
Prague Arabic Dependency Treebank:
Development in Data and Tools
6
Logic of Analytical Trees
 Concepts of dependency and valency
 Reduction: sentence must retain
grammatical correctness if leaves
(terminal nodes) are chopped off
 Trees: clause components  clauses 
sentences  paragraphs etc.
Subtrees of clauses exchangeable for non-clauses
 Nodes: words, tokenized parts of words,
punctuation marks – marked by functions
 Edges: syntactic relations –
governing node  dependent node/subtree
September 23, 2004
Prague Arabic Dependency Treebank:
Development in Data and Tools
7
Some Syntax Issues of Arabic
 Non-verbal predication of several types
 Subordinate non-verbal clauses / modification
 Verb-like behavior of many nominal forms
 Mostly VSO in verbal sentences, but…
 vice-versa in non-verbal clauses
 different, depending on context boundness
 Compound verbs, fixed composite prepositions
 Grammatical co-reference, accusative of
inner object, complex referencing, etc.
September 23, 2004
Prague Arabic Dependency Treebank:
Development in Data and Tools
8
Problem I: Predication
 Head node of tree: PREDICATE
 Why? Steady role in sentence, cannot be omitted
 Verbal predicate: I-go to school
 Non-verbal predicate
 Nominal: The-house a-big (=the house is big)
 Existential: There a-city (=there is a city)
 Prepositional
 Possessive: For him a-house (=he has a house)
 Adverbial: The-mosque in the-city (=…is…)
 Conjunctional: The-problem that (=…is that)
September 23, 2004
Prague Arabic Dependency Treebank:
Development in Data and Tools
9
Predication Types in Trees
dAma [Pred]
Nominal
lasted
kabIrun [Pnom]
a-big [nom.]
iqtirAHu [Sb]
proposal
al-baytu [Sb]
Prepositional
the-house [nom.]
(possessive)
vam~ata [PredE]
there-is
la- [PredP]
for
-hu [Obj]
him
Existential
-hu
[Atr] al-EamalIyata [Obj]
his
the-operation [acc.]
Prepositional
madInatun [Sb]
(adverbial,
a-city [nom.]
fI [PredP]
locative)
Verb-like
behavior
in
(object of noun?)
baytun [Sb]
a-house [nom.]
September 23, 2004
Verbal
al-jAmiEu [Sb]
the-mosque [nom.]
Prague Arabic Dependency Treebank:
Development in Data and Tools
sAEatayni [Adv]
two-hours [acc.]
EalA [AuxP]
on
zumalA’i [Obj]
colleagues
-hi [Atr]
his
al-madInati [Adv]
the-city [gen.]
10
Problem II: Clauses & Co-reference
 Recursiveness: subordinate clause is contained as subtree in place of simple element
 Head-node of clause gets the same function
 Problem: non-verbal structures – clauses or not?
 Compound verbs (mA zAla etc.) treated equally
 Grammatical co-reference: Personal pronoun formally required by another element
 Pronoun must be marked to be treated as such
 Target of reference is unambiguously identifiable
 Often in subordinate clauses, mostly attributive
Ex.: He-wrote a-book number its-pages hundred
September 23, 2004
Prague Arabic Dependency Treebank:
Development in Data and Tools
11
Compound verb,
formed as main verb
and its complement
Attributive
clause,
Clauses & Co-reference
in Trees
zAlat [Pred]
she-stopped
prepositional predicate
(adverbial)
kataba [Pred]
he-wrote
kitAban [Obj]
a-book
tuHis~u [Atv]
al-rajulu
[Sb]
Objective
clause,
she-feels
fI [Atr_PredP]
the-man
[nom.]
zaybabu [Sb]
verbal predicate
in
Zaynab
Attributive clause, mi’atu [Sb]
anna [AuxC]
hundred [nom.]
Referencing
nominal
predicate
that
-hi [Adv_Ref]
pronoun, as
tuEjibu [Obj_Pred]
it
SafHatin [Atr]
attribute in clause
they-impress
jumalan [Sb]
pages [gen.]
sentences [acc.]
mA [AuxM]
not
Referencing
naHwu [Sb]
grammar [nom.]
wADiHun [Atr_Pnom]
pronoun, as
clear [nom.]
adverbial in clause
-hA [Atr_Ref] their
September 23, 2004
Prague Arabic Dependency Treebank:
Development in Data and Tools
-hA [Obj]
her
12
Future Prospects
 Implementation of Functional
Morphology
 Tectogrammatical annotation
 Lexicons of valency frames
 Re-training the feature-based tagger
on MorphoTrees
 Machine-learning on the treebank
data for various purposes
September 23, 2004
Prague Arabic Dependency Treebank:
Development in Data and Tools
13
Thank you
Questions welcome!
http://ckl.mff.cuni.cz/padt/
September 23, 2004
Prague Arabic Dependency Treebank:
Development in Data and Tools
14