Buckwalter Arabic Morphological Analyzer Enhancer (BAMAE)

Download Report

Transcript Buckwalter Arabic Morphological Analyzer Enhancer (BAMAE)

Bibalex Arabic Morphological Enhancer
BAMAE: Buckwalter Arabic Morphological
Analyzer Enhancer
4th International Conference on Arabic Language Processing
(CITALA’12) 2-3 May 2012
Rabat, Morocco
Sameh Alansary
Alexandria University
Bibliotheca Alexandrina
[email protected]
Overview
Introduction.
Issues on BAMA’s output quality.
Preparing data sets (corpus).
Building BAMAE.
Evaluating BAMAE.
Conclusion
4th International Conference on Arabic Language Processing
Rabat, Morocco
May 2nd – 3rd 2012
INTRODUCTION
Previous Arabic analyzed corpora
The Penn Arabic Treebank.
CLARA(Buckwalter
Corpus Lingae Arabcae).
Uses
Arabic Morphological Analyzer (BAMA).
Morphological
andTreebank.
syntactic
analysis.
Prague Arabic a
Dependency
Building
balanced
and annotated
corpus.
100,000
words.
1
million
words annotated.
Adopts
Functional
Generative Description of language.
15,000
words
analyzed.
Most
previous
Arabic analyzed corpora
Multi-level
linguistic
annotations.
POSused
tag
set
based on
recommendations.
BAMA
as EAGLES
it has isbeen
found
be the
The
morphological
level
based
on to
Penn
Treebank’s
model.
most suitable lexical resource.
4th International Conference on Arabic Language Processing
Rabat, Morocco
May 2nd – 3rd 2012
International Corpus of Arabic
(ICA)
Representative corpora from written MSA sources.
Planned to include 100 million words.
Covers different sources and genres.
Markup codes have been added.
The concatenative approach adopted in analyzing the ICA.
BAMA was a choice to analyze the ICA.
4th International Conference on Arabic Language Processing
Rabat, Morocco
May 2nd – 3rd 2012
ISSUES ON BAMA’S
OUTPUT QUALITY
Dealing with Arabic words as their English
counterparts.
Missing information.
Wrong information.
Wrong concatenations and segmentations.
4th International Conference on Arabic Language Processing
Rabat, Morocco
May 2nd – 3rd 2012
Dealing with Arabic words as their
English counterparts.
 BAMA classifies some adverbs
as prepositions or subconjunctions.
‫بين‬
‫إذا‬
These words were modified to be adverbs
of time or place; (ADV_T) or (ADV_P).
 BAMA classifies some verbs as
adverbs.
These words are modified to be verbs.
4th International Conference on Arabic Language Processing
Rabat, Morocco
May 2nd – 3rd 2012
Missing information
 BAMA does not assign number, gender or definiteness in
some cases.
 BAMA does not provide some passives ofNopast
and
present
number,
gender
Such words have been dealt with
or definiteness
verbs.
manually and others have been fixed
No passive
form
is
 BAMAautomatically.
does not provide most imperative form
of
verbs.
provided
 BAMA does not assign a type of adverb; Time
or Place.
No imperative
Such passive and imperative forms
BAMA
does
notmanually.
provide
have been
analyzed
or tags.
form is provided.
all the possible solutions
Such these words were modified to
BAMA
can’t cover all possible
be (ADV_T) or (ADV_P).
The tag NOUN has been added
manually.
More glossaries have been added;
pregnancy and motivation/motivating
glossaries.
ADJ only need
NOUN also.
Wrong information
 BAMA wrongly predict gender, number and definiteness
‫بين‬
in some cases.
‫بين‬
Such words have been dealt with
manually and others have been fixed
automatically.
BAMA classifies
some adverbs
as
inflectional nouns.
Only accusative case is true.
BAMA does not detect lemmas
correctly in some cases:
Such lemmas have been fixed
manually.
4th International Conference on Arabic Language Processing
Rabat, Morocco
May 2nd – 3rd 2012
Wrong concatenations and
segmentations
It concatenates prefix, stem and suffix wrongly.
EX:
It sometimes fails to segment words correctly.
“‫ ”بأكبر‬: ‫ ب‬+‫ أكبر‬ADJ.
EX:
‫ ب‬+ ‫ أكبر‬NOUN.
“‫”اترك‬: Stem = “‫ ”اترك‬by BAMA.
EX:
“‫ ”ا‬+dual
Stem
=“‫”ترك‬.
“‫”زاهدين‬Prefix
: ‫ ين‬+=‫زاهد‬
suffix.
Given by BAMA
‫ ين‬+ ‫ زاهد‬plural suffix.
Not available
EX:
“‫”آمر‬: Prefix “‫ ”آ‬+ Stem = “‫ ”مر‬by BAMA.
Prefix = “‫ ”أ‬+ Stem =“‫”أمر‬.
4th International Conference on Arabic Language Processing
Rabat, Morocco
May 2nd – 3rd 2012
DEALING WITH
BAMA’S
SOLUTIONS
More linguistic information that BAMA
does not provide have been added
Broken plural and EDAFAH features:
Root information:
EX:
The words “‫ ”أبخرة‬and “‫ ”أبواب‬are BR_PL.
 Detected
according
its word
lemma.
Stem
Pattern:
The
words
“‫”قلمي‬
and the
“‫ ”كتب‬in “‫ ”كتب محمد كثيرة‬are EDAFAH.
 Words have two different lemmas and, consequently, two different roots:
Name
feature:
 Rootentity
and stem
pattern are quite independent.
EX:

Arabic
word
may
have one
or two
roots:

Depends
on
the
word’s
lemma,
stem
and
root:should be [tmm] and the lemma
The
word
“‫”يتم‬
has
the
lemma
“>atam~”
its
root
EX:
EX:

Wordsits
may
have
twobe
stem
pattern:
“yutom”
root
should
[ytm].
The words
“‫األمريكية‬
‫المتحدة‬
‫”الواليات‬
have assigned the feature NE behind each one of
word
“‫”سيد‬
hasexhibit
two
roots;
[swd/syd].
The
Some
words
are
metathesis:
EX:
them.
The
EX:
The word
word “‫”محمد‬
“‫”مختار‬has
hasone
tworoot;
stem[Hmd].
patterns; “mufotaEil” and “mufotaEal”.
The word “‫ ”آبار‬has the root [b’r] and the pattern is “>aEofaAl” (‫ )أعفال‬rather than
“>afoEaAl”(‫ )أفعال‬.
4th International Conference on Arabic Language Processing
Rabat, Morocco
May 2nd – 3rd 2012
PREPARING
DATA SET
Preparing Data Set
60,000 400,000
words words
Manually Varified
Training Data
Testing Data
One solution
Manual
for each word
Disambiguation
Buckwalter
Many Solutions
Training Data
4th International Conference on Arabic Language Processing
Rabat, Morocco
May 2nd – 3rd 2012
DISAMBIGUATION
STAGES FOR
BAMA SOLUTIONS
Word
Word- based level.
Context- based level.
Memory- based level.
4th International Conference on Arabic Language Processing
Rabat, Morocco
May 2nd – 3rd 2012
1.Word-based level.
Prefix
Stem
Suffix
+
Concatenation
Possible
4th International Conference on Arabic Language Processing
Grammatical
Arabic
Rules
Impossible
Rabat, Morocco
May 2nd – 3rd 2012
EX
A word should not be adjective in some cases:
30%
eliminated
Adjective
Possessive Pronouns
Nouns
Possessive Pronouns
4th International Conference on Arabic Language Processing
Applying Rules
Rabat, Morocco
May 2nd – 3rd 2012
2. Context - based level.
Disambiguation :
based on the Context of the word
Extracted From
Training Data
Rule:
Prepositions “with no suffix”
Adjective
Verb
4th International Conference on Arabic Language Processing
Rabat, Morocco
Word
Noun in
genitive case
May 2nd – 3rd 2012
‫ال تبك على المفقود حتى ال تفقد الموجود‬
Preposition
Negative Part
‫ال‬
‫حتى‬
Present Verb
‫تفقد‬
Context – based level
Present
Verb only if all previous
Subjective
Applicable
levels mood
failed to
decide the best solution.
Memory – based level
4th International Conference on Arabic Language Processing
Rabat, Morocco
May 2nd – 3rd 2012
3. Memory – based level.

The morphological features of ambiguous word along with its
context are defined along with their occurrences frequency.
Noun
‫تعددت استخدامات التليفون‬
‫المحمول‬
Has 2 Tags.
EX
Adjective
Noun
Noun
Past Verb
Freq( Past Verb + Noun + Noun + Adjective )
Freq( Past Verb + Noun + Noun + Noun)
Memory – based level
– based level
Context
– based level
One
solution
for eachWord
word
by BAMAE
4th International Conference on Arabic Language Processing
Rabat, Morocco
May 2nd – 3rd 2012
BAMAE: A final view
 Each word has 17 pieces of linguistic information.
Namely: word, lemma, vocalization, gloss, preffix1, preffix2, preffix3,
 Each
word is
indexed
with
its meta
information.
stem, sufix1,
sufix2,
gender,
number
definiteness,
case, Arabic stem,
stem pattern and root.
EVALUATING
BAMAE
Testing Data
60,000 words
Precision & Recall
Evaluation
0.87
One Solution for
each word
0.83
4th International Conference on Arabic Language Processing
Rabat, Morocco
May 2nd – 3rd 2012
Conclusion
 Results are promising using rule based approach.
 Bibalex Enhancer is built on the top of BAMA instead of
building another Analyzer from scratch.
 Future Plan:
1. Increase the training data size.
2. Enhance the Arabic Linguistic Rules for disambiguation.
3. Adopt Language Modeling tools (SRILM).
 The system will be released soon over Bibalex website:
www.bibalex.org/UNL
4th International Conference on Arabic Language Processing
Rabat, Morocco
May 2nd – 3rd 2012