PolUKR - domeczek

Download Report

Transcript PolUKR - domeczek

Principles of organizing
a common morphological tagset and
a search engine for PolUKR
(Polish-Ukrainian Parallel Corpus)
Польсько-Український паралельний корпус
Polsko-Ukraiński Korpus Równoległy
http://corpus.domeczek.pl
Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences
Olga Shypnivska, ULIF, Ukrainian Academy of Sciences
Magdalena Turska, Warsaw University
Main objectives and expected
applications
•
•
•
•
at least 3 mln tokens ; representative
sentence-level alignment
morphological annotation with a common tagset
public access; user-friendly
• linguistic material for
– (independent) language learning
– bilingual dictionaries
– research on grammar and lexis
• translation memory for humans and machines
Statistics (prototype version)
total
Polish part
Ukrainian
part
texts
70
35
35
tokens
359 926
179 087
180 120
characters
3 863 564
1 449 376
2 407 034
Kb
3941
1492
2439
Search (present)
• based on PERL regular expressions
• any searched chain has to be “embraced” by “/”.
E.g. /Холодна війна/
• special characters:
І alternative; ) end of subchain
[ i ] beginning and end of a defined character class
? 1 or 0 appearances; * 0 or more appearances
+ 1 or more appearances
\s any empty character
\w any letter, digit, underlining sign
\b end of word, \ escape
Examples of search formulae
/jako/  „jako”
/jako\s/  „jako, niejako, dwojako”
/\bjako/  „jakość’
/norma\./  „norma” before a dot
Sources of morphological
information
• Polish: IPI PAN corpus + …
• Ukrainian:
- grammatical dictionary by ULIF, UAS (Igor
Shevchenko) lemma <> wordform
- morphological analyzer (information is
slightly different, built for homonymy
disambiguation)
- no lemmatization (so far)
Types of tagsets
SYMBOLS: encoding all possible grammatical characteristics of a
wordform in one symbol English (BNC), Ukrainian
- takes little machine memory but requires too much of the human one
CHAINS:
contain codes corresponding to particular grammatical categories
and/or their values; morphological characteristics of a wordform is
represented by a sequence of such codes
can be even more economic than symbols, if a query concerns
morphological categories owned by several lexico-grammatical
classes 
• positional Czech
every category (and its values) have a fixed position in a chain
• flexemic Polish, Russian
every category has its own subtagset
Multext-East tagset for
En Ro Sl Cz Bg Et Hu Hr Sr Re
• chain-like; criticised
• 14 PoS: N10, V15, A12, P(ron)17, Det10, T(he)6,
adveRb6, S(adposition)4, C(onj)7, nuMeral12, Intjn2,
X(residual), Yabbr5, Qparticle3
• only Bg and Hu do not have modal verbs and copulas
• En Ro have determiners, Ro Hu Re have articles, Bg –
has neither (analitism, segmentation);
• Is a Bg noun formally indefinite if the article is attached
to the adj? (cf. agglutinativity of Pl być)
• negation as morphological category
• Cz transgresivity (adverbial participle)
Treatment of participles
• Polish (no aspectual characteristics)
(Here and further cited by: Adam Przepiórkowski i Marcin Woliński A Flexemic Tagset for
Polish.)
• Ukrainian (aspect and tense)
Дієслово, дієприслівник, доконаний вид, минулий час, активний стан
VW прочитавши
Дієслово, дієприслівник, недоконаний вид, теперішній час, активний стан
UQ читаючи
(Here and further cited by: Широков В.А et al. Корпусна лінгвістика.)
• PolUKR
participle I (doing/having done) characterised by aspect
Treatment of pronouns
• notorious Slavonic pronoun problem: 296 unique
tags for 309 pronouns
• Polish: division into 1-2 p, 3p and siebie (ów,
jak?)
• Ukrainian: pro-noun, pro-adjective
• Russian: also pro-predicative and pro-adverb
• Czech: many subcategories on the level of
SubPoS
• PolUKR: Ua approach and Pl division into 1-2
and 3 person
Treatment of predicatives
• Polish: adverbs with modal semantics like
można, trzeba (it is) allowed/one can, (it
is) necessary, ?to
• Ukrainian (code X0) includes adverbs of
state like жарко, шкода, жаль (it is) hot, (it
is) a pity
• PolUKR moving the category from the
morphological level to the semantic one
http://www.ruscorpora.ru/search-main.html
Search engine for PolUKR
•
•
•
•
choose the direction of the search (Ua>Pl or Pl<Ua)
search conditions for both languages (RvonW)
3 levels of search:
exact form
(lemma) with the morphological choice
using Poliqarp-like tag formulas (for advanced users)
idea of subcategories (either a POS or a SUBPOS can be selected,
but not both; similarly, one cannot select all subcategories of a
POS), cf. aliases in IPI PAN corpus
• alternative is ensured through tick-off boxes, so that one can choose
EITHER „VERB finite past” OR „NOUN dative neutral” OR sth else,
etc.)
• restrictions on choice within 1 of 10 POS
VERB




NOUN




infinitive
participle I
non-finite form
finite form
general
proper name
pro-noun 1-2 person
pro-noun 3 person
ADJECTIVAL

adjective, participle I
and cardinal numeral

pro-adjective

indeclinable adjective
NUMERAL

genderic

non-genderic
ADVERB
PARTICLE
PROPOSITION
CONJUNCTION
INTERJECTION
aspect
perfective
imperfective
mood
imperative
indicative
person
first
second
third
case
nominative
genetive
dative
accusative
instrumentative
locative
vocative
case
nominative
genetive
dative
accusative
instrumentative
locative
gender
masculine
feminina
neutral
pluralia tantum
number
singular
plural
gender
masculine
feminina
neutral
number
singular
plural
case
nominative
genetive
dative
accusative
instrumentative
locative
gender
masculine
feminina
neutral
tense
present
future
past
gender
masculine
feminine
neutral
number
singular
plural
Built-in restrictions on search
category
mood OR
tense OR
person
gender
gender pluralia tantum
case vocative
none
can be selected (active) only if
the following category/value(s) have been selected by the
user:
finite form
finite form AND past tense OR
adjective AND singular number OR
pro-adjective AND singular number OR
pro-noun 1-2 person AND singular number OR
pro-noun 3 person AND singular number OR
ЧИСЛІВНИК родовий AND singular number
NOUN general OR
NOUN proper name
NOUN general OR
NOUN proper name
indeclinable adjective OR
ADVERB OR
PARTICLE OR
PROPOSITION OR
CONJUNCTION OR
INTERJECTION
Literature
• INTERA unified tagset project www.elda.org/intera
• Tomas Erjavec et al. Multext-East specifications for Slavic
languages, Budapest, 2003.
• Jan Hajič. Positional Tags: Quick Reference (Czech „HM”
Morphology), 2000.
• Adam Przepiórkowski and Marcin Woliński. A Flexemic Tagset for
Polish. In: The Proceedings of the Workshop on Morphological
Processing of Slavic Languages, EACL 2003.
http://nlp.ipipan.waw.pl/~adamp/Papers/2003-eacl-ws12/ws12.pdf
• Elena Paskaleva. Balcan South-East Corpora Aligned to English. In:
The Proceedings of the Workshop on Common Natural Language
Processing Paradigm for Balkan Languages, EACL 2003
• Широков В.А et al. Корпусна лінгвістика. Київ: Довіра, 2005.