Transcript Slide 1

Disambiguation of homographic adjective
and adverb forms in Croatian texts
Danijela Merkler*, Daša Berović*, Željko Agić**
* Department of Linguistics
** Department of Information Sciences
Faculty of Humanities and Social Sciences, University of Zagreb
[email protected]; [email protected]; [email protected]
NooJ 2011
Dubrovnik
2011-06-15
Talk overview




NooJ2011
Dubrovnik
2011-06-15
project ACCURAT
problem and corpora
modeling local grammars and applying them
statistical evaluation
ACCURAT
 FP7 project
 main goal - to develop methods and techniques to
overcome one of the central problems of machine
translation – the lack of linguistic resources for underresourced areas of machine translation
 key innovation - creation of methodology and tools to
measure, to find and to use comparable corpora to
improve the quality of MT
 the ACCURAT project will significantly contribute not
only to the theory of MT, but also to corpus linguistics,
information extraction and natural language
processing in general
NooJ2011
Dubrovnik
2011-06-15
Scientific objectives
 create comparability metrics – to develop the
methodology and determine criteria to measure the
comparability of source and target language
documents in comparable corpora
 establish research methods for alignment and
extraction of lexical, terminological and other
linguistic data from comparable corpora
 disambiguation – important process for POS and MSD
tagging
NooJ2011
Dubrovnik
2011-06-15
Problem
 parallel and comparable resources are sparse for Croatian
when paired with any of the languages included in the
project, especially if the other language is under-resourced
as well
 importance of high quality annotation for existing language
resources for Croatian
 building (factored) language models for MT
 using text anchors in comparable resources
 MSD-tagging and lemmatization errors detected in existing
Croatian language resources
NooJ2011
Dubrovnik
2011-06-15
 e.g. Croatian National Corpus v2.5 (automatically lemmatized
and MSD-tagged), manually annotated subcorpora, Croatian
Dependency Treebank
 manual analysis of their annotation reveals regular patterns in
these errors
Problem
 forms of descriptive adjectives in the nominative singular case in
the neuter gender are the same as the forms of the adverbs that
are made from those adjectives by suffixation
 these adverbs are realized in context
 in most cases adverb is made from adjective that has abstract
meaning
 there are several types of word forms
NooJ2011
Dubrovnik
2011-06-15
 the forms of adverbs and adjectives that occur with
no semantic constraints: razdragano (gleeful), bahato (arrogant),
ubrzano (rapidly), uzrujano (upset), umiljato (cuddly)
 forms that are made from verbs: drhtavo (shaking), laskavo
(flattering), šepavo (lame)
 forms that have dual meaning (concrete and abstract): mlako
(lukewarm), šugavo (itchy), mračno (darkly), hladno (cold), gorko
(bitter)
 forms that denote spatial and temporal relations: rano (early),
duboko (deeply), plitko (shallow), lijevo (left)
Corpora
 Croatia Weekly
 100 kw newspaper corpus (newspaper published from
1998 to 2000, 118 numbers)
 it covers different domains: politics, economy and
finance, tourism, ecology, culture, art, sports
 part of Croatian side of the Croatian-English Parallel
Corpus manually lemmatized and MSD-tagged using the
MULTEXT-East v3 morphosyntactic specifications
 1984.
 Orwell’s "1984" corpus, manually lemmatized and MSDtagged using MULTEXT-East v4
 languages: En, Ro, Sl, Cs, Et, Hu, Sr, Bg, Ru, Mk, Hr...
 encoded in TEI P4 (XML)
NooJ2011
Dubrovnik
2011-06-15
Corpora
 imported the corpora to NooJ
 used the NooJ XML import feature
 kept the MSD feature annotations for adjectives,
adverbs, nouns and verbs
 converted the annotations for these PoS from MultextEast to NooJ format for lexical resources
 modified feature annotations
 e.g. MTE verb type from auxiliary, copulative to PG
(auxiliary verb) in NooJ
 preprocessing enabled designing the rules without
using Croatian resources for NooJ, i.e. skipping NooJ
linguistic analysis
NooJ2011
Dubrovnik
2011-06-15
Patterns
 we noticed several types of patterns in which adverbs
that are homographic with adjectives occur
 they are defined by their contextual environment
1) Vpg + A* + V → Vpg + R* + V
2) Vpg + A + A* → Vpg + R + A*
3) A* + V → R* + V
4) A + A* + N → R + A* + N
NooJ2011
Dubrovnik
2011-06-15
Vpg + A* + V
NooJ2011
Dubrovnik
2011-06-15
Vpg + A + A*
NooJ2011
Dubrovnik
2011-06-15
A* + V
NooJ2011
Dubrovnik
2011-06-15
A + A* + N
NooJ2011
Dubrovnik
2011-06-15
Statistics 1
 manually checked concordances
cw100
orwell
cw100 + orwell
Vpg + A* + V
64 %
62 %
63 %
Vpg + A + A*
100 %
100 %
100 %
A* + V
82 %
54 %
67 %
A + A* + N
69 %
75 %
70 %
total
77 %
61 %
70 %
 errors frequently include the word sve, so we
upgraded all grammars in order not to recognize sve
NooJ2011
Dubrovnik
2011-06-15
Example of upgraded grammar
NooJ2011
Dubrovnik
2011-06-15
Statistics 2
cw100
orwell
cw100 + orwell
Vpg + A* + V
100 %
83 %
92 %
Vpg + A + A*
100 %
100 %
100 %
A* + V
87 %
63 %
74 %
A + A* + N
78 %
100 %
82 %
total
89 %
73 %
83 %
 obtained results improved after we applied new
grammars
 significant difference between newspaper and
literature corpus
NooJ2011
Dubrovnik
2011-06-15
Future work
 forms of relational adjectives in the nominative
singular case in the masculine gender are the same as
the forms of the adverbs that are made from those
adjectives by suffixation (junački, pučki, bratski,
životinjski)
 disambiguation of these forms also depends
on the grammatical context in which they occur, so it
can also be done in a similar way
 applying the disambiguation rules to other Croatian
language resources
NooJ2011
Dubrovnik
2011-06-15
Thank you for your attention.
www.accurat-project.eu
The research within the project Accurat leading to
these results has received funding from the European
Union Seventh Framework Programme (FP7/20072013), grant agreement no 248347.