Arabic morphology and POS
Download
Report
Transcript Arabic morphology and POS
UW CLMA
11/19/2008
A short intro with a couple of demonstrations
ARABIC MORPHOLOGY AND POSTAGGING
1
UW CLMA
11/19/2008
OUTLINE
Arabic morphology: overview of the problem
Prior Art with demonstration of Buckwalter’s
AraMorph
Sketch of enhancements to AraMorph
Demonstration
Future directions
2
UW CLMA
11/19/2008
ARABIC MORPHOLOGY: OVERVIEW OF THE
PROBLEM
Short vowels are not represented
The contrast between diphthongs and long
vowels is not represented
Most closed class morphemes are written as
affixes to the content word categories: Nouns,
Adjectives, Verbs and prepositions
3
UW CLMA
11/19/2008
ARABIC MORPHOLOGY: OVERVIEW (CONT.)
Some examples (glossing over a lot of detail):
البيت
شاهد الرجل الفيلم فرجع إلى
$Ahd alrjl Alfylm frjE {lA Albyt
$aaHada r-rajul-u l-fiylm-a fa-rajaEa {ilaA l-bayt-i
Saw-3sg.m. the-man-nom the-film-acc and-soreturned-3sg.m to the-house-gen
The man watched the film and then went home.
This example is not so bad
4
UW CLMA
11/19/2008
REGULAR EXPRESSIONS FOR ORTHOGRAPHIC
WORDS
(conj)?(enclitic_preposition)? noun_stem
(plural)(possesive_pronoun)
(conj)?(definiteness marker)? noun_stem (plural)?
(conj)? full_word_preposition (genitive_pronoun)?
(conj)? complementizer (object_pron)?
(conj)? (modal)? (ImpVerbSubjAgr) verb_stem
(plural_subject_marker)? (object_pronoun)?
(conj)? (modal)? verb_stem (perfVerbSubjAgr)?
(object_pronoun)?
5
UW CLMA
11/19/2008
INHERENT AMBIGUITY
Some strings with multiple analyses
فقد
== fqd : either the verb
fqd = he lost
f qd = and so (verbal modal)
OR
fqd smEth = ; فقد سمعتهCan be analyzed as
a)
f qd smE t h (and so I had heard him)
b) fqd smEp h (he lost his reputation)
6
UW CLMA
11/19/2008
OTHER ISSUES BEYOND THE SCOPE OF THIS
TALK
Arabic spans 14 centuries and 22 countries
Is the liturgical language of over 1 billion Muslims
The Standard Language has never been a spoken variety.
The vernaculars have never been standardized.
The LDC corpus is the only annotated corpus that is readily
available. The last time I looked the treebank part was less
than a million tokens
7
UW CLMA
11/19/2008
PRIOR ART
Buckwalter’s Aramorph from LDC (a port from work
done @ Xerox)
Ported to Java on top of Lucene (!) by Pierrick
Brihaye circa 2003
http://cvs.savannah.gnu.org/viewvc/aramorph
Tagset and segmentation description
http://www.ldc.upenn.edu/Catalog/docs/LDC200
3T06/POS-info.txt
Buckwalter’s Transliteration scheme
http://www.qamus.org/transliteration.htm.
8
UW CLMA
11/19/2008
AND NOW A DEMONSTRATION OF ARAMORPH
The point here is that most word strings have more
than one legal analysis.
The other point is that the number of types is quite
high, unless you do something to reveal the
content word behind all the function morpheme
affixes.
Kitaab (book)
Al-kitaab (the book)
These two queries in Arabic return different sets of
results on google
9
UW CLMA
11/19/2008
A FEW WORDS WRT ARAMORPH
AraMorph will generate all the legal analyses
for which it has an entry in its lexicon
Pierrick Brihaye ported AraMorph to Java
AraMorph is the first stage in a lot of Arabic text
processing done by researchers in the US.
10
UW CLMA
11/19/2008
ENHANCEMENTS TO ARAMORPH
I build this POS tagger in stages on top of
Pierrick Brihaye’s port of of AraMorph
The first thing I did was to port in a bigram
model of segmented text from the LDC
This was used to choose the most likely
segmentation sequence out of all of the
analyses returned by Buckwalter’s analyzer
11
UW CLMA
11/19/2008
ARCHITECTURE (AS IT EVOLVED)
With a 5-word sliding window
generate
all sequences of segmentations for that 5word window
based on all the analyses returned by AraMorph.
This scheme produced acceptable results
Sometime
later a trigram model of the tags was
added and
given 50% weight with the segmentation scores to
decide which tags to keep with the segments
12
UW CLMA
11/19/2008
THIS BEARS SOME SIMILARITY TO OTHER WORK
DONE IN 2005
Habash, Nizar and Owen Rambow. Arabic
Tokenization, Morphological Analysis, and Part-ofSpeech Tagging in One Fell Swoop. In Proceedings
of the Conference of American Association for
Computational Linguistics (ACL’05).
His team used Ripper (Cohen, 1996) to learn a
rulebased classifier (Rip).
They also used AraMorph as their starting point to
produce all legal morphological sequences.
http://www.mt-archive.info/ACL-2005-Habash1.pdf
13
UW CLMA
11/19/2008
HOW WELL DOES THE POS TAGGER PERFORM?
Good question, still TBD
I meant to pull out some of the training data
and test it against a piece of the LDC corpus.
I ran out of time
Hand analysis puts it at better than 90%.
At
some point I turned on the option to not toss the
vowels provided by AraMorph.
This is observably less accurate
14
UW CLMA
11/19/2008
FIRST: A WORD FROM MY SPONSOR
I’m allowed to talk about this system
I was told that I could expose its functionality
on a website
I am not allowed to distribute it or use it for
commercial purposes
There is an earlier tagger that does not
inorporate Lucene or AraMorph. It is based on
Brill’s TB learning @
http://innerbrat.org/segmentTagDownload
15
UW CLMA
11/19/2008
THE DEMOS
Tag to Buckwalter transliteration output
Tag to enamex style tags
Tag to
Utf8
arabic
Re-attaching the segments
Reduced tagset
Reloading the dictionary every time is annoying
Tag
with a server and thin client
16
UW CLMA
11/19/2008
FUTURE DIRECTIONS
Any further work will require me to rebuild
everything from scratch
Uncouple it from Lucene
Port it to c++ or c#
Bring in a statistical language model or two for
recovering the short vowels.
Use some state-of-the-art machine learning toolkits to
improve performance
Start annotating some of my corpora
17
UW CLMA
11/19/2008
FUTURE DIRECTIONS
See if I can embed it in some practical applications
such as
language
teaching document production
preprocessing for
machine translation systems
preprocessing ASR
Text to speech
Bootstrap annotation tools for other Afro-Asiatic
languages
Tigrinya,
Somali, Hausa, Hebrew, Arabic vernaculars, Amharic,
Amazigh, Coptic, Egyptian Hyroglyphs, Babylonian, Punic
Help with ODIN??
18
UW CLMA
11/19/2008
THE END
19