Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of

Download Report

Transcript Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of

Machine Translation
using Tectogrammatics
Zdeněk Žabokrtský
IFAL, Charles University in Prague
Overview
Part I - theoretical background
Part II - TectoMT system
MT pyramid (in terms of PDT)
Key question in MT: optimal level of abstraction?
MT triangle:
"transfer distance"
interlingua
tectogram.
surf.synt.
morpho.
level of
abstraction
?
raw text.
source
language
target
language
Our answer: somewhere around tectogrammatics
high generalization over different language characteristics, but
still computationally (and mentally!) tractable
Basic facts about "Tecto"
introduced by Petr Sgall in 1960's
implemented in Prague Dep. Treebank 2.0
each sentence represented as a
deep-syntactic dependency tree
functional words accompanying an
autosemantic word "collapse" with it into a
single t-node, labeled with the autosemantic
t-lemma
added t-nodes (e.g. because of pro-drop)
semantically indispensable syntactic and
morphological categories rendered by a
complex system of t-node attributes
(functors+subfunctors, grammatemes for
tense, number, degree of comparison, etc.)
SMT and limits of growth
current state-of-the-art approaches to MT
n-grams + large parallel (and also monolingual) corpora +
huuuuge computational power
n-grams are very greedy!
availability (or even existence!) of more data?
example: Czech-English parallel data
~1 MW - easy (just download and align some tens of e-books)
~10 MW - doable (parallel corpus Czeng)
~100 MW - not now, but maybe in a couple of years...
~1 GW - ?
~10 GW (~ 100 000 books) - Was it ever translated???
How could tecto help SMT?
n-gram view:
manifestations of lexemes are mixed with manifestations of
language means expressing the relations between the lexemes and
of other grammar rules
inflectional endings, agglutinative affixes, functional words, word
order, punctuation orthographic rules ...
It will be delivered to Mr. Green's assistants at the nearest meeting.
 training data sparsity
how could tecto ideas help?
within each sentence, clear separation of meaningful "signs" from
"signs" which are only imposed by grammar (e.g. imposed by
agreement)
clear separation of lexical, syntactical and morphological meaning
components
 modularization of the translation task  potential for a
better structuring of statistical models  more effective
exploatation of the limited training data
"Semitecto"
abstract sentence representation, tailored for MT purposes
motivation:
not to make decisions which are not really necessary for the MT
process (such as distinguishing between many types of temporal and
directional semantic complementations)
given the target-language "semitecto" tree, we want the sentence
generation to be deterministic
slightly "below" tecto (w.r.t. the abstraction axis):
adopting the idea of separating lexical, syntactical and morphological
meaning components; adopting the t-tree topology principles
adopting many t-node attributes (especially grammatemes,
coreference, etc.)
but (almost) no functors, no subfunctors, no WSD, no pointers to
valency dictionary, no tfa...
closer to the surface-syntax
main innovation: concept of formemes
Formemes
formeme = morphosyntactic language means expressing
the dependency relation
n:v+6 (in Czech) = semantic noun which is on the surface expressed in
the form of prepositional group in locative with preposition "v"
v:that+fin/a (in English) = semantic verb expressed in active voice as a
head of subordinating clause introduced with the sub.conjunction "that"
obviously, sets of formeme values are specific for each of the
four semantic parts of speech
in fact, formemes are edge labels partially substituting functors
what is NOT captured by formemes:
morphological categories imposed by grammar rules (esp. by
agreement), such as gender, number and case for adjectives in
attributive positions
morphological categories already represented by
grammatemes, such as degree of comparison for adjectives,
tense for verbs, number for nouns
Formemes in the tree
Example: It is extremely important that Iraq held elections to
a constitutional assembly.
Some more examples
of proposed formemes
Czech
968 adj:attr
604 n:1
552 n:2
497 v:fin/a
308 n:4
260 adv:
169 n:v+6
133 adj:compl
117 v:inf
104 n:poss
86 n:7
82 v:že+fin/a
77 v:rc/a
63 n:s+7
53 n:k+3
53 n:attr
50 n:na+6
47 n:na+4
42 v:aby+fin/a
English
661 adj:attr
568 n:attr
456 n:subj
413 n:obj
370 v:fin/a
273 n:of+X
238 adv:
160 n:poss
160 n:in+X
146 v:to+inf/a
92 adj:compl
91 n:to+X
...
62 v:rc/a
...
51 v:that+fin/a
...
39 v:ger/a
Three-way transfer
translation process:
(I have been asked by him to come -> Požádal mě, abych přišel)
1. source language sentence analysis up to the
"semitecto" layer
2. tranfer of
lexemes (ask  požádat
, come  přijít)
formemes (v:fin/p  v:fin/a , v:to+inf  v:aby+fin/a)
grammatemes (tense=past1past , 0  verbmod=cdn)
3. target language sentence synthesis from the
"semitecto" layer
Adding statistics...
translation model (e.g.
from parallel corpus
Czeng, 30MW)
"binode" language model (e.g.
from partially parsed Czech
National Corpus, 100MW)
P(lT |lS)
P(fT |fS)
P(lgov ,ldep,f)
source language
target language
Part II
TectoMT System
Goals
primary goal
to build a high-quality linguistically motivated MT
system using the PDT layered framework, starting
with English -> Czech direction
secondary goals
to create a system for testing the true usefulness of
various NLP tools within a real-life application
to exploit the abstraction power of tectogrammatics
to supply data and technology for other projects
Main design decisions
Linux + Perl
set of well-defined, linguistically relevant levels of
language representation
neutral w.r.t. chosen methodology (e.g. rules vs.
statistics)
in-house OO architecture as the backbone,but easy
incorporation of external tools (parsers, taggers,
lemmatizers etc.)
accent on modularity: translation
scenario as a sequence of
translation blocks (modules
corresponding to individual
NLP subtasks)
MT triangle:
interlingua
tectogram.
surf.synt.
morpho.
raw text.
source
language
target
language
TectoMT - Example of analysis (1)
Sample sentence: It is extremely important that
Iraq held elections to a constitutional assembly.
TectoMT - example of analysis (2)
phrase-structure tree:
TectoMT - example of analysis (3)
analytical tree
TectoMT - example of analysis (4)
tectogrammatical tree (with formemes)
Heuristic
alignment
Sentence pair:
It is extremely important that
Iraq held elections to a
constitutional assembly.
Je nesmírně důležité, že v
Iráku proběhly volby do
ústavního shromáždění.
Formeme
pairs
extracted
from parallel
aligned trees
593 adj:attr
290 v:fin/a
282 n:1
214 adj:attr
165 n:2
152 adv:
149 n:4
102 n:2
86 n:v+6
79 n:poss
73 n:1
61 n:2
51 v:inf
50 adj:compl
39 n:2
34 n:4
34 n:attr
32 v:že+fin/a
32 n:2
27 n:4
27 n:2
26 adj:attr
25 v:rc/a
20 v:aby+fin/a
adj:attr
v:fin/a
n:subj
n:attr
n:of+X
adv:
n:obj
n:attr
n:in+X
n:poss
n:obj
n:obj
v:to+inf/a
adj:compl
n:
n:subj
n:attr
v:that+fin/a
n:poss
n:attr
n:subj
n:poss
v:rc/a
v:to+inf/a
Processing blocks
in the current prototype
src-t-layer
1) segment the input text into
sentences
2) tokenize the sentences
3) morphological tagging
4) lemmatize each token
5) phrase-structure parsing
6) mark phrase heads
7) run constituencydependency
transformation
8) mark subject nodes
9) derive the t-tree topology
10) label t-nodes with t-lemmas
11) assign coordination/apposition
functors
12) mark finite clauses
13) detect grammatical coreference in relative clauses
14) determine the semantic part
of speech
15) fill grammateme attributes
(number, tense, degree...)
src-p-layer
trg-t-layer
src-a-layer
src-m-layer
input text
output text
16) detect the sentence
modality
23) resolve morphological
agreement
17) detect formeme
24) expand complex verbs forms
18) clone the source-language ttree
25) add prepositions and
conjunctions
19) translate t-lemmas using a
simple 1:1 probabilistic lexicon
26) perform conjugation and
declination
20) set the gender attribute
according to the noun lemma
27) resolve word order
21) set the aspect attribute
according to the verb lemma
22) predict the target-language
formeme
28) add punctuation
29) perform vocalization of
prepositions
30) concatenate the tokens into
final sentence string
Thank you !