Cross language information retrieval in South African

Download Report

Transcript Cross language information retrieval in South African

CLIR: opening up
possibilities for indigenous
languages in South Africa?
Research team: Erica Cosijn1, Heikki Keskustalo2,
Ari Pirkola2, Karen de Wet1 & Kalervo Järvelin2
1University of
Pretoria, Pretoria, South Africa
2University of Tampere,
Finland
1
Introduction
•
•
•
•
•
•
What is CLIR?
General methodology
Afrikaans-English CLIR
Zulu-English CLIR
The road ahead
Conclusions
2
What is CLIR?
• The basic idea to bridge the language boundary by
providing access in one language (the source
language) to documents written in another
language (the target language)
• Source language: the language that gives access
to the required information; the quiery language
thus
• Target language: the language of the content in
the database
3
CLIR (cont.)
• Use CLIR in:
– query translation and/or document translation
from the source language.
• Main strategies for query translation
– dictionary-based methods
– corpus-based methods, and
– machine translation
4
CLIR approaches
• Corpus-based methods: work with
frequency analysis
– Implication: aboutness of the two collections
should be similar
• Machine translation: uses morphological
parser etc.
5
CLIR: Machine translation
• Translates source language texts into target
language using:
– Translation dictionaries
– Other linguistic resources
– Syntax analysis
• Limited availability
6
CLIR: Dictionary Based
• Problems
–
–
–
–
Limitations of dictionaries
Inflected word forms
Phrases and compound words
Lexical ambiguity
• Possible solution
– Approximate string matching
7
English
language
database
Bilingual
source-Eng
dictionary
Source
language
query
Dictionary
translation
Other
linguistic
resources
English
language
query
Retrieval in
English
language
database
English
result
8
CLEF
The Cross-Language Evaluation Forum
supports global digital library applications by
(i) developing an infrastructure for the
testing, tuning and evaluation of information
retrieval systems operating on European
languages in and (ii) creating test-suites of
reusable data which can be employed by
system developers for benchmarking
purposes
9
Retrieval system and test data
• Inquery – commercially available
• Probabilistic – i.e. best match, not exact
• “Bag of words” or structured queries
• used by Finnish partners in their projects
• TEST DATA: CLEF 2001
–
–
–
–
112 000 newspaper articles
35 queries (title and description)
English to English baseline for comparison
2 sets
• Afrikaans/Zulu title
• Afrikaans/Zulu title and description
10
Afrikaans-English CLIR
• Afrikaans spoken by third largest group in
South Africa as first language
• Originated mainly from Dutch
• Germanic language
• Not inflectional
• Good technical vocabulary
• Good resources – e.g. dictionaries, spell
checkers, parsers, compound splitters.
11
Methodology : Resources
• Electronic bilingual dictionary
– Filtered commercial dictionary
• Stopword list
– Translated from English and adapted
• Morphological analyzer
– Derived statistically from analysis of large
newspaper text body
12
Dictionary Filtering
• Headwords identified by string-based rules
• Alternative spellings separated and listed as
separate headwords
• Homonyms: each sense listed as separate
headword
• Compounds identified and listed as separate
headwords
• Plurals not included, but solved by morph analyzer
• Manual checking and fine-tuning
13
Stopword list
• Translation of existing English stopword list
• Check homonyms, e.g. again = weer = weather
• Large text body – Afrikaans language newspaper
articles – 3500 words
• Frequency analysis compared to translated list
• Ad hoc additions
• Accented words added
• N=341
14
Morphological analyser (1)
• Based on patterns in language
• Newspaper text used for manual analysis
• 3500 words sorted by frequency facilitated
duplicate removal
• 1200 unique words
15
Morphological analyser: Plurals
• All plural forms manually identified from 1200
words
• 62% of Afrikaans plurals formed by adding -e, -s
or -’s to singular
• 13% of plurals have a double vowel in singular
and plural is formed by removing one vowel and
adding an -e to the end of the word
• Thus 75% of plurals solved by two simple rules
16
Morphological analyser: Affixes
Manual analysis of text shows
• Past tense indicated by ge- prefix, but
sometimes embedded, e.g. aangesteek
• Various suffixes are common: -te, -ste, -er,
-ing, -ke, -le, -de, etc.
• Suffix stripping possible by longest
common substring (LCS) matching
17
Morphological analyser:
Compounds
Manual analysis of text shows
• Relatively high occurrence of compounds in
Afrikaans - 1%
• Different types of compounds
• With or without fogemorphemes (joining
morphemes)
• Only two fogemorphemes identified,
namely -s- and -e18
Morphological analyser test data:
Statistics - solvable
1
2
3
4
Stopwords
Headwords
oo, aa, ee, uu rule solvable
e, s, ’s rule (OR Longest Common
Substring)
5 More LCS matching
6 Stripping prefix, e.g. ge7 Compound splitting (multiple LCS runs +
fogemorpheme stripping)
Total
N
150
565
18
%
14,0
52,7
1,7
85
7,9
59
13
5,5
1,2
50
4,7
940
87,7
19
Morphological analyser test data:
Statistics – not solvable
N
8
9
10
11
Compounds incorrectly solved
Past tense -ge- embedded in word
Not solvable by morphological analyser
Misspelt in original text
Total
8
26
16
2
52
%
0,7%
2,4%
1,5%
0,2%
4,8%
12
Proper nouns
80
7,5%
20
Original
Afrikaans
query key
Is the key found as-is
(i.e. as a translation
dictionary entry)?
Preprocess Key
(verify character set used: preserve both
Uppercase and Lowercase letters)
Y
N
Does the key start with
Uppercase letter?
Y
Modify
Uppercase to
Lowercase
Y
Remove the
prefix from the
word
Y
Normalize the
word to
singular form
Y
Decompose the
word utilizing
fogemorphemes
N
Is the key found after
removal of ge-prefix?
N
Is key recognized as
plural of a “double
vowel singular case??
Is the word (or
decomposed part)
a stop word?
Y
Remove
N
N
Is the key a compound
(i.e. decomposable
using LCS method)
Translate using
Afr-Eng
Dictionary
N
Modify
Lowercase to
Uppercase
Is the Uppercase
form found as-is in
the dictionary?
Word (or
component)
translations in
English
Y
N
Unrecognized
Afrikaans key
Is the key a
Stopword?
Y
Remove
N
Fuzzy
matching
(target index)
Most similar
words from the
English database
21
Morphological analyser – steps
(condensed from flow chart)
•
•
•
•
•
Match words found in dictionary
Uppercase becomes lower case
Remove ge- prefix
Double vowel plural case
Match longest common subsequence (suffixes as
well as compounds solved)
• Modify lower case to uppercase (probably proper
noun)
• Fuzzy match “as is” with target language database
22
Example
Database used: Cleff
English title: Pesticides in Baby Food
Afrikaans source query: Plaagdoders in babakos
English baseline query: #sum(pesticide baby food)
The English target query translated from the
Afrikaans source query: #sum(#syn(nullstr lues
die van plague plague blight infestation pest
affliction vexation killer) #syn( nullstr) #syn(
baby food))
23
Results
24
Conclusions
• Dictionary probably too large
• Normalizer worked quite well
• Copmpound splitting by LCS methods
mostly successful
• Stopword list adequate
• Results quite promising
25
Zulu-English CLIR
• isiZulu spoken by 8,8 million – largest number of
speakers for a single language in SA
• Agglutinative – grammatical information
conveyed by attaching pre- and suffixes to roots
and stems
• Nouns: Grammatical genders – 8 classes in Zulu
with distinctive prefixes in every class for singular
and plural forms
• Verbs: Affixes mark grammatical relations such as
object, subject, tense, mood, aspect
26
Methodology: Zulu to English
Monolingual
Zulu dictionary
Zulu
Source
Query
Approx.
Dictionary
Matching
CLEF
English
Database
Zulu- Engl .
Dictionary
Zulu base
form query
Dictionary
translation
English
Query
Retrieval in
English
Database
English
Result
27
Methodology (1)
• Monolingual word list
– No electronic bilingual dictionary
• Approximate matching
– Of all five metric and non-metric similarity
measures tested, skipgrams yielded best results
– The Zulu word could be identified within three
words 80% of the time
28
Methodology (2)
• Translations from Zulu source words into English
done manually
• Problems experienced in this process
– Paraphrasing due to disparate vocabularies
E.g. isinyabulala – person weak from age
– Homonyms – single words with various meanings
E.g. –zwe isizwe izizwe = tribe OR rapidly spreading
brain disease
29
Example of paraphrasing
Find documents that describe acts of terrorism or
vandalism against European synagogues since the
end of the Second World War.
Thola
Find
imibhalo
scriptures
echaza
that describe
izenzo
acts
zokuphekulazikhuni
of terror
noma
and
izinto
the breaking of
things
nezindlu
the houses
ngobudlova
with violent force
elwa
that fight
zesonto
of Sunday
zamaJuda
of the Jews
ase-Europe
of Europe
kusukela
from
ekupheleni
the end
kwezimpi
of the war
zesibili
of second
zomhlaba
of the world
30
Analysis of translation problems
interrogative
2
enclitic
2
verb extensions
12
conjunctives
6
locatives
9
Error types
homonyms
55
vowel elision
5
vowel coalescence
17
pre-nasalisation
5
palatalisations
10
paraphrases
46
proper names
41
zululisations
20
borrowed words
28
0
10
20
30
40
50
60
Number of occurences
31
The road forward
• Parsers and morphological analysers in
process
• Spellcheckers has extensive word lists
• Increasing web presence of indiginous
languages, especially government sites and
newspapers leads to possibility of pararlel
corpora
• Cross Cultural Information Retrieval?
32
Conclusions
• Indigenous Knowledge is a valuable
resource – it is important to make it
accessible
• Learn from international research and create
a good product from the outset
• Many opportunities for research
33
Cross Language Information
Retrieval (CLIR)
• To provide access in one language to
documents written in another language
• Query translation or document translation
• Approaches
– Corpus-based techniques
– Machine translation
– Dictionary-based techniques
34