here - AfLaT.org

Download Report

Transcript here - AfLaT.org

Part-of-Speech tagging of Northern Sotho:
Disambiguating polysemous function words
Gertrud Faaß
[email protected]
Ulrich Heid
[email protected]
Elsabé Taljard
[email protected]
DJ Prinsloo
[email protected]
This Talk
• Prologue
• Challenges for tagging Sotho texts
• Objectives
• Descriptive state of the art
for tagging of Sotho texts
– Tools
– Tagsets
• The ambiguity problem
• Methodology
• Results
• Conclusions & future work
Nine Official Bantu Languages of SA
• Sotho Group
– Northern Sotho / Sepedi
– Tswana
– Southern Sotho
• Nguni Group
– Zulu
– Swati
– Xhosa
– Ndebele
*********************
– Venda and Tsonga
Noun class system
1
Cl.No
CP
Example
1
2
moba-
mosadi ‘woman’
basadi ‘women’
1a
2b
Øbo-
malome ‘uncle’
bomalome ‘uncle & co’
3
mo-
monwana ‘finger’
4
me-
menwana ‘fingers’
5
le-
lebone ‘light’
6
ma-
mabone ‘lights’
7
se-
selepe ‘axe’
8
di-
dilepe ‘axes’
9
N- / Ø-
mpša ‘dog’ / hlogo ‘head’
10
diN- / di-
dimpša ‘dogs’ / dihlogo ‘heads’
14
bo-
bodulo ‘residence’
(6)
ma-
madulo‘residences’
15
go-
go ruta ‘to learn’
16
fa-
fase ‘below’
17
go-
godimo ‘above’
18
mo-
morago ‘behind’
N-
N- / Ø-
ntle ‘outside’ pele ‘in front’
(24) ga-
ga-
gare ‘middle’
Concordial agreement – Northern
Sotho
Taljard and Bosch (2005)
Challenges for tagging
• Ambiguity, for example:
– function words:
-a- being 9-ways ambiguous,
-go- up to 30(11,6,5,…)-ways
• Unknown words (N+V)
– noun derivation:
toropo (town) -> toropong (in/at/to town)
– verb derivation: next slides
Challenges: unknown words
• Agglutinating languages:
extensive use of affixes
– Example: rekišeditšwe ‘was / were sold for’
< rek- ‘buy’ (verb root) + -iš- (causative) + el- (applied) + -il- (past tense) + -w(passive) + -e (inflectional ending)
Examples of suffixes and combinations for a single verb
•
ROOTetšane, ROOTetšanwa, ROOTetšanwe, ROOTiša, ROOTišitše,
ROOTišwa,
ROOTišitšwe,
ROOTišana,
ROOTišane,
ROOTišanwa,
ROOTišanwe, ROOTišega, ROOTišegile, ROOTišetša , ROOTišeditše,
ROOTišetšwa,
ROOTišeditšwe,
ROOTišetšana,
ROOTišetšane,
ROOTišetšanwa, ROOTišetšanwe, ROOTišiša, ROOTišišitše, ROOTišišwa,
ROOTišišitšwe, ROOTišišana, ROOTišišane, ROOTišišanwa, ROOTišišanwe,
ROOToga, ROOTogile, ROOTogwa, ROOTogilwe, ROOTogana, ROOTogane,
ROOToganwa, ROOToganwe, ROOTogela, ROOTogetše, ROOTogelwa,
ROOTogetšwe, ROOTola, ROOTotše, ROOTolwa, ROOTotšwe, ROOTolana,
ROOTolane, ROOTolanwa, ROOTolanwe, ROOTolega, ROOTolegile,
ROOTolela, ROOToletše, ROOTolelwa, ROOToletšwe, ROOTolelana,
ROOTolelane, ROOTolelanwa, ROOTolelanwe, ROOTolla, ROOTolotše,
ROOTollwa, ROOTolotšwe, ROOTollana, ROOTollane, ROOTollanwa,
ROOTollanwe, ROOTollega, ROOTollegile, ROOTollela, ROOTolletše,
ROOTollelwa, ROOTolletšwe, ROOTollelana, ROOTollelane, ROOTollelanwa,
ROOTollelanwe, ROOTolliša, ROOTollišitše, ROOTollišwa, ROOTollišitšwe,
ROOTollišana, ROOTollišane, ROOTollišanwa, ROOTollišanwe, ROOTologa,
ROOTologile,
ROOTologana,
ROOTologane,
ROOTologanwa,
ROOTologanwe, ROOTološa, ROOTološitše, ROOTološwa, ROOTološitšwe,
ROOTološana,
ROOTološane,
ROOTološanwa,
ROOTološanwe,
ROOTološetša,
ROOTološeditše,
ROOTološetšwa,
ROOTološeditšwe,
ROOTološetšana, ROOTološetšane, ROOTološetšanwa, ROOTološetšanwe,
ROOToša,
ROOTošitše,
ROOTošwa,
ROOTošitšwe,
ROOTošetša,
ROOTošeditše,
ROOTošetšwa,
ROOTošeditšwe,
ROOTošetšana,
ROOTošetšane, ROOTošetšanwa, ROOTošetšanwe
Solution
for unknown verbs and nouns
• Verb guesser: detection of
– longest match suffix combinations
– occurrences in corpora
• Noun guesser: matching of
– singular/plural-forms
– nominal suffixes
– occurrences in corpora
Objectives
• Tagging with a detailed tagset: class numbers
– Nouns, adjectives, pronouns, concords,
demonstratives
• Disambiguation
• Motivation: tagging used as preprocessing for:
– Chunking, parsing
– Lexicography (tag relatively large corpora,e.g. PSC)
– Detailed linguistic research
(e.g. grammar development)
– Information extraction
State of the art for tagging:
Sotho languages
• Comparison of tagsets and tools
is hardly possible
– Different applications of tagged material
(linguistic description, lexicography, parsing, etc.)
– Different number of tags
– Differences in granularity
Descriptive State of the Art:
tagsets and tools
Authors
No. of tags
Noun class
yes/no
Tool?
Van Rooy and Pretorius (2003)
106
no
no
De Schryver and De Pauw (2007)
56
no
yes
Kotzé (several, e.g. 2008)
partial
no
yes
Taljard et al. (2008)
141/262
yes
no
This paper
25/141
yes
yes
Descriptive State of the Art for
tagging: Sotho languages
Tools:
• Full
– De Schryver and de Pauw (2007)
Northern Sotho tagger
(statistical)
• Partial
– Kotzé (several publications, e.g. 2008)
Verbal and nominal segment
(finite state)
Descriptive state of the art for
tagging: Sotho languages
Applications of tagsets:
• De Schryver and de Pauw (2007):
used for lexicography
• Van Rooy and Pretorius (2003):
linguistic description of Setswana
• Taljard et al. (2008):
morphosyntactic and general linguistic
description
The ambiguity problem
• -a-, -go-: see handout for possible
readings
• Local context may not identify noun class
of subject concord:
(Masogana) …
A
nwa bjalwa
CS06 drink beer
(Young men) … “They drink beer.”
The ambiguity problem:
possible solutions
– Dependent on objectives
• Flat tagset ignoring irrelevant details
(cf. handout for -go-)
• Layered tagset: granularity
Tagset (cf. Handout)
• Level 1
– Noun = (N)
– Subject concord (CS), Object concord (CO)
– Pronouns (PRO)
• Level 2
– emphatic (only for pronouns) EMP
– possessive (dto.) POSS
• Level 3
– Classes -> N.01a, N.01, N.02, N.03, … , PERS, etc.
• Example:
noun of class 1 = N.01
possessive pronoun of class 6 = PRO.POSS.06
RF tagger technology
(cf. Schmid and Laws (2008)
•
•
•
•
Hidden Markov Model (HMM) Tagger
Additional external lexicon
Large, fine-grained tagsets
Several levels of description:
e.g. German articles:
ART.Definiteness.Case.Number.Gender
• Calculates joint (product) probabilities
Training corpus
• 45,000 tokens
manually annotated word forms
from two text types
• Not balanced
(25,000 tokens out of a novel,
2 times 10,000 tokens out of dissertations)
Comparing taggers on manually
annotated data
•
•
•
•
Tree-Tagger (Schmidt 1994)
TnT Tagger (Brants 2000)
MBT Tagger (Daelemans et al. 2007)
RF-Tagger (Schmid and Laws 2008)
Effects of size of training corpus
No more adding of training data necessary
Effects of highly polysemous
function words
• Distribution problem
• Probability guesses for scarce labels
become unreliable
–a:
» PART (45) vs. CS.01 (1,182)
» 91% incorrect labeling of PART.
• Detailed discussion:
• Handout: -a- refer to pages 2, 4
Alternative proposal: hybrid
taggers Spoustová et al. (2007)
• Combine
rule-based tagging with statistical tagging
For Northern Sotho:
- Contextual disambiguation works fine
with RF-tagger
if unambiguous indicators are available
– Disambiguating macros (using the same indicators)
hence have little effect
– Ambiguous contexts hard to account for either way:
need for parsing?
Results: 10-fold cross validation
• Without guessers
(to simulate similar conditions for TnT and MBT)
– RF-tagger: 91.00%
– TnT tagger: 91.01%
– MBT: 87.68%
• with guessers:
(several thousand nouns and verbs part of the lexicon)
– Tree-tagger: 92.46%
– RF-tagger: 94.16%
Conclusions
• Different intended uses lead to different tagsets
(granularity, number of tags)
• Including noun class information is essential for general
linguistic research, e.g. grammar development,
applications of chunking/parsing
• RF-Tagger performs well for our layered tagset with the
existing amount of training data (45,000), over 94%
correct
• Ambiguous contexts and sparse data problem combined
lead to a high error rate for statistical parsing - not likely
to be solvable with macros
– Chunking / Parsing might lead to a more adequate solution for
this problem
Future work
• Apply RF-tagger to the PSC corpus
• Evaluate results
• Instead of preprocessing rules, a partial
postprocessing may make sense (e.g.
chunking, parsing)