Arabic morphology and POS

Download Report

Transcript Arabic morphology and POS

UW CLMA
11/19/2008
A short intro with a couple of demonstrations
ARABIC MORPHOLOGY AND POSTAGGING
1
UW CLMA
11/19/2008
OUTLINE
Arabic morphology: overview of the problem
 Prior Art with demonstration of Buckwalter’s
AraMorph
 Sketch of enhancements to AraMorph
 Demonstration
 Future directions

2
UW CLMA
11/19/2008
ARABIC MORPHOLOGY: OVERVIEW OF THE
PROBLEM
Short vowels are not represented
 The contrast between diphthongs and long
vowels is not represented
 Most closed class morphemes are written as
affixes to the content word categories: Nouns,
Adjectives, Verbs and prepositions

3
UW CLMA
11/19/2008
ARABIC MORPHOLOGY: OVERVIEW (CONT.)

Some examples (glossing over a lot of detail):
 ‫البيت‬
‫شاهد الرجل الفيلم فرجع إلى‬
 $Ahd alrjl Alfylm frjE {lA Albyt
 $aaHada r-rajul-u l-fiylm-a fa-rajaEa {ilaA l-bayt-i
 Saw-3sg.m. the-man-nom the-film-acc and-soreturned-3sg.m to the-house-gen
 The man watched the film and then went home.

This example is not so bad
4
UW CLMA
11/19/2008
REGULAR EXPRESSIONS FOR ORTHOGRAPHIC
WORDS






(conj)?(enclitic_preposition)? noun_stem
(plural)(possesive_pronoun)
(conj)?(definiteness marker)? noun_stem (plural)?
(conj)? full_word_preposition (genitive_pronoun)?
(conj)? complementizer (object_pron)?
(conj)? (modal)? (ImpVerbSubjAgr) verb_stem
(plural_subject_marker)? (object_pronoun)?
(conj)? (modal)? verb_stem (perfVerbSubjAgr)?
(object_pronoun)?
5
UW CLMA
11/19/2008
INHERENT AMBIGUITY

Some strings with multiple analyses
 ‫فقد‬
== fqd : either the verb

fqd = he lost

f qd = and so (verbal modal)
 OR

fqd smEth = ‫ ; فقد سمعته‬Can be analyzed as
 a)
f qd smE t h (and so I had heard him)
 b) fqd smEp h (he lost his reputation)
6
UW CLMA
11/19/2008
OTHER ISSUES BEYOND THE SCOPE OF THIS
TALK

Arabic spans 14 centuries and 22 countries

Is the liturgical language of over 1 billion Muslims

The Standard Language has never been a spoken variety.

The vernaculars have never been standardized.

The LDC corpus is the only annotated corpus that is readily
available. The last time I looked the treebank part was less
than a million tokens
7
UW CLMA
11/19/2008
PRIOR ART
Buckwalter’s Aramorph from LDC (a port from work
done @ Xerox)
 Ported to Java on top of Lucene (!) by Pierrick
Brihaye circa 2003
http://cvs.savannah.gnu.org/viewvc/aramorph
 Tagset and segmentation description
http://www.ldc.upenn.edu/Catalog/docs/LDC200
3T06/POS-info.txt
 Buckwalter’s Transliteration scheme
http://www.qamus.org/transliteration.htm.

8
UW CLMA
11/19/2008
AND NOW A DEMONSTRATION OF ARAMORPH
The point here is that most word strings have more
than one legal analysis.
 The other point is that the number of types is quite
high, unless you do something to reveal the
content word behind all the function morpheme
affixes.

Kitaab (book)
 Al-kitaab (the book)
 These two queries in Arabic return different sets of
results on google

9
UW CLMA
11/19/2008
A FEW WORDS WRT ARAMORPH
AraMorph will generate all the legal analyses
for which it has an entry in its lexicon
 Pierrick Brihaye ported AraMorph to Java
 AraMorph is the first stage in a lot of Arabic text
processing done by researchers in the US.

10
UW CLMA
11/19/2008
ENHANCEMENTS TO ARAMORPH
I build this POS tagger in stages on top of
Pierrick Brihaye’s port of of AraMorph
 The first thing I did was to port in a bigram
model of segmented text from the LDC
 This was used to choose the most likely
segmentation sequence out of all of the
analyses returned by Buckwalter’s analyzer

11
UW CLMA
11/19/2008
ARCHITECTURE (AS IT EVOLVED)

With a 5-word sliding window
 generate
all sequences of segmentations for that 5word window
 based on all the analyses returned by AraMorph.

This scheme produced acceptable results
 Sometime
later a trigram model of the tags was
added and
 given 50% weight with the segmentation scores to
decide which tags to keep with the segments
12
UW CLMA
11/19/2008
THIS BEARS SOME SIMILARITY TO OTHER WORK
DONE IN 2005
Habash, Nizar and Owen Rambow. Arabic
Tokenization, Morphological Analysis, and Part-ofSpeech Tagging in One Fell Swoop. In Proceedings
of the Conference of American Association for
Computational Linguistics (ACL’05).
 His team used Ripper (Cohen, 1996) to learn a
rulebased classifier (Rip).
 They also used AraMorph as their starting point to
produce all legal morphological sequences.
 http://www.mt-archive.info/ACL-2005-Habash1.pdf

13
UW CLMA
11/19/2008
HOW WELL DOES THE POS TAGGER PERFORM?
Good question, still TBD
 I meant to pull out some of the training data
and test it against a piece of the LDC corpus.
 I ran out of time
 Hand analysis puts it at better than 90%.

 At
some point I turned on the option to not toss the
vowels provided by AraMorph.
 This is observably less accurate
14
UW CLMA
11/19/2008
FIRST: A WORD FROM MY SPONSOR
I’m allowed to talk about this system
 I was told that I could expose its functionality
on a website
 I am not allowed to distribute it or use it for
commercial purposes
 There is an earlier tagger that does not
inorporate Lucene or AraMorph. It is based on
Brill’s TB learning @
 http://innerbrat.org/segmentTagDownload

15
UW CLMA
11/19/2008
THE DEMOS
Tag to Buckwalter transliteration output
 Tag to enamex style tags
 Tag to

 Utf8
arabic
 Re-attaching the segments
 Reduced tagset

Reloading the dictionary every time is annoying
 Tag
with a server and thin client
16
UW CLMA
11/19/2008
FUTURE DIRECTIONS

Any further work will require me to rebuild
everything from scratch





Uncouple it from Lucene
Port it to c++ or c#
Bring in a statistical language model or two for
recovering the short vowels.
Use some state-of-the-art machine learning toolkits to
improve performance
Start annotating some of my corpora
17
UW CLMA
11/19/2008
FUTURE DIRECTIONS

See if I can embed it in some practical applications
such as
 language
teaching document production
 preprocessing for




machine translation systems
preprocessing ASR
Text to speech
Bootstrap annotation tools for other Afro-Asiatic
languages
 Tigrinya,
Somali, Hausa, Hebrew, Arabic vernaculars, Amharic,
Amazigh, Coptic, Egyptian Hyroglyphs, Babylonian, Punic

Help with ODIN??
18
UW CLMA
11/19/2008
THE END
19