Shaalan Rule

Download Report

Transcript Shaalan Rule

Rule-based approach in Arabic
NLP: Tools, Systems and
Resources
Dr Khaled Shaalan
Professor, Faculty of Computers &
Information, Cairo University
On Secondment to BUiD, UAE
Khaled.shaalan@{buid.ac.ae,
gmail.com}
CITALA2009 - Morroco
Agenda








Objective
Language Tasks
NLP Approaches
Rule-based Arabic Analysis and generation
tools
Rule-based Arabic NLP applications
Some Arabic NLP Free Resources
Major and Arabic mailing lists
Conclusion
Objective

To show how rule-based approach has
successfully used to develop Arabic
natural language processing tools and
applications.
Separating Language Tasks








English vs. French vs. Arabic vs . . .
spoken language (dialogue) vs written test vs
hand written script
Genuine Script vs transliterated (Romanized)
script
Vocalized (vowelized) vs non-vocalized
Understanding vs. generation
First language learner vs second language
learner
Classical or Qur’anical Arabic vs Modern Standard
Arabic vs colloquial (dialects)
Stem-based vs root-based
Rules

Situation/Action


If match(stem.prefix, def_article)
then romve(stem.prefix,Stem_FS)
If match(stem.definitness,indefinite)
then morph_gen(stem.definitness,Stem_FS)
Common Mistake



Rule-based approach is not a rule-based
expert systems !!!!!!!
Both consist of rules.
Rule-based expert systems solves the
problem by Recognize-Act Cycle


Loop
Conflict resolution strategy
Recognize-Act Cycle
Domain Knowledge
Rule
Base
loop
1.
Match: Rules are
compared to working
memory to determine
matches. if no rule
matches then stop
2.
Conflict Resolution:
Select or enable a
single rule for execution
3.
Execute: Fire the
selected rule
•
Add new fact, or
•
Learn a new rule
end loop
Match
n
Conflict
Resolution
Fact
Base
New
Rule
1
Execute
New
Fact
Working Memory
7
NLP Approaches


Rule-based
Statistical-based
NLP Approaches (1)



Relies on handconstructed rules that
are to be acquired from
language specialists
requires only small
amount of training data
development could be
very time consuming



developers do not
need language
specialists expertise
requires large amount
of annotated training
data (very large
corpora)
automated
NLP Approaches (2)





some changes may be hard
to accommodate
not easy to obtain high
coverage of the linguistic
knowledge
useful for limited domain
Can be used with both wellformed and ill-formed input
High quality based on solid
linguistic




some changes may
require re-annotation of
the entire training corpus
Coverage depends on the
training data
Not easy to work with illformed input as both wellformed and ill-formed are
still probable
Less quality - does not
explicitly deal with syntax
Rule-based Arabic NLP tools




Morphological Analyzers
Morphological Generators
Syntactic Analyzers
Syntactic Generators
Rule-based Arabic
Morphological Analyzer
Morphological Analysis


Breakdown the inflected Arabic word into a root/stem,
affixes, features.
Example: sa- ‘uEty- kumA (‫ )ﺳﺄﻋﻂﯾﻜﻤﺎ‬- ‘will I give you…’
‫ﺳ‬: saTYPE: Particle
INFLECTION:
‘Future’
‫ﺄﻋﻂﯾ‬: -‘uEtyTYPE: VERB
ASPECT: IMPERF
MOOD: IND
PERS: 1
GENDER: M/F
NUMBER: SG
SUBJ: I
‫ﻜﻤﺎ‬: -kumA
TYPE: AFFPR
GENDER: M/F
NUMBER: DUAL
GF: OBJ
Rules - Augmented Transition
Network (ATN) technique



Rules associated with arcs represent the
context-sensitive knowledge about the
relation between a root and inflections.
More than one rule may be associated with
one arc.
Conditions associated with the arcs are
placed in such a way that the arc to be
traversed first is the one that leads to the
most probable solution.
Arabic Morphology using ATN Technique
Types of Rules




Remove Prefix or Suffix
Remove doubled letter
Add/change Hamza, Weak letter,…
…
Analysis of the verb "‫"شاهدتك‬
(I saw you): Remove suffixes
‫شاهدت‬
‫شاهدتك‬
last1 = “‫”ك‬
S0
S1
last2 = “‫”ت‬
S2
‫شاهد‬
S3
•stem: "‫( "شاهد‬saw)
• perfect
•1st person sg pronoun: "‫"ت‬
•2nd person sg pronoun "‫"ك‬
S10
Analysis of the verb “‫”يلعبون‬
(they are playing): Remove prefix & suffix
‫لعبون‬
Begin2 = “‫لعبون ”ي‬
S0
S1
last2 = “‫”ون‬
S2
‫لعب‬
S3
•stem: “‫( "لعب‬played)
• imperfect
•Plural subject
S10
Issues in the morphological
analysis







Overgeneration (too many output)
Ambiguity
Reconstruction of vowels
MultiWord/compound Expressions
Out-of-Vocabulary (OOV)
Handling ill-formed input
 Detection (spell checking)
 Correction- relaxation “‫ ”ه‬instead of “‫”ة‬
Prevent ill-formed output
 Check the compatibility (the prefix “‫ ”ف‬cannot come
after the prefix “‫( ”ب‬or “‫))”ك‬.
Rule-based Arabic
Morphological Generator
Morphological generation

Synthesis of an inflected Arabic word from a
given root/stem according to a combination
of morphological properties that include:






definiteness (definite article “‫)”ال‬,
gender (masculine, feminine),
number (singular, dual, plural),
case (nominative, genitive, accusative,…),
person (first, second, third)
…
Types of Rules

synthesis of inflected



Noun
Verb
particle
Synthesis of inflected Nouns








definite noun
feminine noun
pluralize noun
dual noun
attach a prefix preposition
attach a suffix pronoun
end case
….
Synthesis of feminine noun


If noun.gender = masculine
Then attach suffix feminine letter
Example:

“‫) ”زوج‬husband)  “‫( ”زوجة‬wife)
Synthesis of suffix pronoun


If pronoun.person = first and
pronoun.number = singular
Then attach first person singular suffix
pronoun
Example:

“‫( ”زوجة‬wife)  “‫( ”زوجتي‬my wife)
Synthesis of inflected Verbs
(very complex-rich in form and meaning)





conjugate
conjugate
conjugate
conjugate
….
a
a
a
a
verb
verb
verb
verb
with
with
with
with
tense
number
prefix pronoun
suffix pronoun
Rule: synthesize first person plural of
assimilated verbs
Input: first person singular past verb
Output: inflected verb
Example: ‫ وصلنا‬- ‫ سنصل‬-‫نصل‬
If verb.tense = future
then remove first weak & attach_prefix("‫)"سن‬
else if verb.tense = present
then remove first weak & attach_prefix("‫)"ن‬
else attach_suffix(verb.stem,"‫)"نا‬
Issues in the morphological
generation



MultiWord/compound Expressions
Out-of-Vocabulary (OOV)
Some forms need special handling:



Substitution: This man – ‫هذا الرجل‬
literal numbers (complex nouns)
Arabic script



‘‫ ’ل‬+ ‘‫ ’ال‬ ‘‫’للـ‬
“‫ ”ي‬+ “‫ ”زمالء‬ ‘‫ ’زمالءي‬ ‘‫’زمالئي‬
“‫ ”غرفة‬ “‫”غرفتان‬
Rule-based Arabic Syntactic
Analyzer
Types of Rules

Grammatical rules:


Describe sentence and phrase structures,
and ensure the agreement relations
between various elements in the sentence.
Parsing

Accepts the input and generates the
sentence structure (parse tree)
Parsing of the sentence “‫”الطالبة مجتهدة‬
The student (sg,f) is diligent (sg,f)
‫الطالبة مجتهدة‬
noun (definite, fem, sg)
noun (indefinite, fem, sg)
definite(definite, fem, sg)
enunciative (indefinite, fem, sg)
Inchoative (defined, fem, sg)
nominal sentence
Agreement:
•Number
•Gender
Nominal sentence -> definite_Inchoative(Number,Gender)
indefinite_enuciative(Number,Gender)
Issues in the syntactic analysis

Ambiguity (more than parse tree)


Disambiguation techniques
Handling ill-formed input


Detection (grammar checking)
Recovering (Partial parsing - parses =
chunks to be related)
Rule-based Arabic Syntactic
Generator
Types of Rules



Determine phrase structures
Determine syntactic structure
Ensure the agreement relations
between various elements in the
sentence.
Rule: verb-subject agreement
Input: verb and inflected subject (a preverbal NP )
Output: inflected verb agreed with its
inflected subject
synthesize_verb(Subject.number,verb.stem)
synthesize_verb(Subject.gender,verb.stem)
An agreement example:
‫األوالد زاروا خمس متاحف قديمة‬
the-boys visited-they five museum old
The boys visited five old museums
‫قديمة‬
Adj-noun
(G)
‫متاحف‬
‫خمس‬
counted-Num
(G)
‫زاروا‬
‫األوالد‬
verb-Subject
(N,G)
Issues in the syntactic
generation




Word order (VSO,SVO, etc.)
Agreement (full/partial)
dropping the subject pronoun (called Pro-drop),
i.e., to have a null subject, when the inflected
verb includes subject affixes.
Syntax that captures the source/intended meaning


My son is 8 = ‫أبني عمره ثماني سنوات‬
I did not understand the last sentence = ‫أنا لم أفهم الجملة‬
‫األخيرة‬
A Rule-based Arabic NLP
applications



Named Entity Recognition
Machine translation
Transferring Egyptian Colloquial Dialect
into Modern Standard Arabic
What is entity recognition?


Identifying, extracting, and normalizing
entities from documents such as names
of people, locations, or companies.
Makes unstructured data more
structured
Politics of Ukraine
In July 1994, Leonid Kuchma was elected as Ukraine's second president in free and fair
elections. Kuchma was reelected in November 1999 to another five-year term, with 56
percent of the vote. International observers criticized aspects of the election, especially
slanted media coverage; however, the outcome of the vote was not called into question. In
March 2002, Ukraine held its most recent parliamentary elections, which were
characterized by the Organization for Security and Cooperation in Europe (OSCE) as
flawed, but an improvement over the 1998 elections. The pro-presidential For a United
Ukraine bloc won the largest number of seats, followed by the reformist Our Ukraine bloc
of former Prime Minister Viktor Yushchenko, and the Communist Party. There are 450
seats in parliament, with half chosen from party lists by proportional vote and half from
individual constituencies.
Entity Extractor
Person
Date
Location
Person Entity Recognition (1)
Example: ‘‫’الملك األردني عبد هللا الثاني‬
The Jordanian king Abdullah II

We want to have a rule that recognizes a
person name composed of a first name
followed by optional last names, based on
a preceding person indicator pattern.
Person Entity Recognition (2)
The Rule component of this example:


Name Entity: ‫[عبد هللا‬Abdullah]
indicator pattern:



an honorific such as "‫[ "الملك‬The king]
Nasab: (optional) inflected from a location name
"‫[ "األردني‬Jordanian].
The rule also matches an optional ordinal
number appearing at the end of some names such
as "‫[ "الثاني‬II].
Person Entity Recognition (3)
((honorfic+(location(‫)?))ي|ية‬+
first_Name(last_Name)?+(number)?)

This (Regular Expression) rule can recognize:
 ‫الملك عبد هللا‬
 ‫الملك األردني عبد هللا‬
 ‫الملك األردني عبد هللا الثاني‬
 ‫الملكة األردنية رانيا‬
 …
Issues in the Arabic NER




Complex Morphological System
(inflections)
Non-casing language (No initial capital
for proper nouns)
Non-standardization and inconsistency
in Arabic written text (typos, and
spelling variants)
Ambiguity
Machine Translation



Direct
Transfer
Interlingua
MT Approaches
MT Pyramid
Interlingua
Source syntax
Source word
Analysis
Transfer
Direct
Target syntax
Target word
Generation
English-to-Arabic Transfer based Approach
source sentence
(English)
English Dic.
Sentence Analysis
Morphological
& syntactic Analysis
Rules of English
English Parse Tree
Bi-ling Dic.
Transfer
English-to-Arabic
Transformation Rules
Arabic Parse Tree
Arabic Dic.
Sentence Synthesis
Target sentence
(Arabic)
Morphological Gen. &
Synthesis Rules of
Arabic
Transfer approach


Involves analysis, transfer, and
generation components
If you have an Arabic parser & Arabic
syntactic generator, All you need is to
acquire the transfer rules and build the
transfer component
Simple Transfer
(1)
[wi:$1, wi+1:$2, …, wk:$k] (1  i  k)
[wk:$k, wk-1:$k-1, …, wi:$i] (1  i  k)
Networks performance evaluation  ‫تقييم أداء شبكة‬
np
np
noun
noun
np
networks
pl
noun
performa
nce
sg
transfer
np
noun
evaluation
sg
‫تقييم‬
sg
np
noun
‫أداء‬
sg
np
noun
‫شبكة‬
pl
Issues in the Transfer-based
MT approach

Synonyms of a word


Agreement


Acquisition  “‫ ”اكتساب‬or “‫”استخالص‬.
intelligent tutoring systems  “ ‫نظم التعليم‬
‫ ”الذكية‬or “‫”نظم التعليم الذكي‬
Problems with prepositions
did you do fungal analysis? 
“‫”هل قمت بـتحليل الفطر؟‬


…
Interlingua MT – Multilingual
translation


Interlingua = Semantic Representation
Deep analysis –




no need for transfer component)
Only analysis and generation components
Add Arabic analyzer to translate to
other languages
Add Arabic generator to translate from
other languages
Analysis of Arabic to Interlingua
‫ أنا أرغب في حجز غرفة في الفندق‬:‫العميل‬
Preprocessor
Sentence
Analyzer
Arabic
Lexicon
Morphological
Analyzer
Arabic Grammar
Rules
Arabic
Morphology Rules
Parse Tree
Map
Lexicon
Mapper
Ontology
Interlingua(IF)
c:introduce-topic+reservation+disposition+room (room-spec=(room,
specifier=hote,identifiability=yes),disposition=(desire,who=i))
Generating Arabic from Interlingua
Interlingua(IF)
c:introduce-topic+reservation+disposition+room (room-spec=(room,
specifier=hote,identifiability=yes),disposition=(desire,who=i))
Map
Lexicon
Mapper
Feature Structure
Ontology
Map Rules
Sentence
Generator
Arabic
Lexicon
Arabic Grammar
Rules
Morphological
Generator
‫ أنا أرغب في حجز غرفة في الفندق‬:‫العميل‬
Arabic Morphology
Rules
Issues in the interlingua
approach

Interlingua:



language-neutral representation
captures the intended meaning of the
source sentence
Requires a fully-disambiguating parser
Transferring Egyptian Colloquial
Dialect into Modern Standard Arabic



Be able to reuse MSA processing tools
with colloquial Arabic by transferring
colloquial Arabic words into their
corresponding MSA words.
Facilitate the communication with
colloquial Arabic speakers
Restore the Arabic dialect to the
standard language in use nowadays.
A one-to-one transfer example
‫امتي؟‬
Mapping
‫متي؟‬
when?
A one-to-many transfer example
‫عال‬
On-the
Mapping
‫ال‬
the
‫علي‬
on
A complete sentence example
‫جيت امتي؟‬
You-came when?
Mapping
‫جئت متي؟‬
reordering
‫متي جئت؟‬
•Step (1)
• ‫جيت‬ ‫جئت‬
• ‫ امتي‬ ‫متي‬
•Step (2)
• the New Segment Position for
the word “‫ ”امتى‬is
start of sentence (SoS)
When did-you-come ?
Issues in the transfer to MSA

More investigations are needed
Arabic NLP Free Resources
Arabic NLP Free Resources
Arabic Morphological Analyzers

Tim Buckwalter Morphological
http://www.qamus.org/
 http://www.ldc.upenn.edu/Catalog/CatalogE
ntry.jsp?catalogId=LDC2002L49


Xerox

http://www.cis.upenn.edu/~cis639/a
rabic/input/keyboard_input.html
Arabic Morphological Analyzers

Aramorph

http://www.nongnu.org/aramorph/english
/index.html
Arabic spell checker

Aspell
http://aspell.net/
 http://www.freshports.org/arabic/aspell

Arabic Morphological
Generation

Sarf

http://sourceforge.net/projects/sarf
Tokenization & POS tagging

ArabicSVMTools: The tools utilize the
Yamcha SVM tools to tokenize, POS tag
and Base Phrase Chunk Arabic text
http://www1.cs.columbia.edu/~mdiab/
 http://www1.cs.columbia.edu/~mdiab/softw
are/AMIRA-1.0.tar.gz

Tokenization & POS tagging

MADA: a full morphological tagger for
Modern Standard Arabic.

http://www1.cs.columbia.edu/~rambow/soft
ware-downloads/MADA_Distribution.html
POS tagging

Stanford Log-linear Part-Of-Speech Tagger
http://nlp.stanford.edu/software/tagger.sht
ml
 http://nlp.stanford.edu/software/stanfordarabic-tagger-2008-09-28.tar.gz

Tokenization & POS tagging

Attia's Finite State Tools for Modern
Standard Arabic

http://www.attiaspace.com/getrec.asp?rec=
htmFiles/fsttools
Arabic Parsers

Dan Bikel’s Parser
http://www.cis.upenn.edu/~dbikel/
 http://www.cis.upenn.edu/~dbikel/software.ht
ml


Attia Arabic Parser
http://www.attiaspace.com/
 http://decentius.aksis.uib.no/logon/xle.xml

Arabic wordnet

Arabic WordNet
http://www.globalwordnet.org/AWN/
 http://personalpages.manchester.ac.uk/staff
/paul.thompson/AWNBrowser.zip

Translation resources

Tools: GIZA++, MOSES, Pharaoh,
Rewrite and BLEU


http://www.statmt.org/
APIs:
http://code.google.com/apis/ajax/playgroun
d/#translate
 http://code.google.com/apis/ajax/playgroun
d/#batch_translate

Transliterate

Transliterate

http://code.google.com/apis/ajax/playgroun
d/#transliterate_arabic
Mailing Lists – just to be
connected to the NLP community

[email protected]


[email protected]


http://www.linguistlist.org/
[email protected]


http://mailman.uib.no/listinfo/corpora
http://www.semitic.tk/
[email protected]

http://www.arabicscript.org/CAASL3/index.html
Conclusion (1)


Arabic requires the treatment of the
language constituents at all levels:
morphology, syntax, and semantics.
Most of the researches in Arabic NLP
are mainly concentrated on the analysis
part aiming at automated
understanding of Arabic language.
Conclusion (2)


Arabic NLP in general is significantly
under developed.
In order to bridge this gab and help
Arabic NLP research to catch up with
the many recent advances of Latin
languages, we need collaborative
efforts from the Arabic research
community.
Conclusion (3)

We need Public Domain (in Electronic
Form) for:





Linguistic resources such as large Arabic
(bilingual) Corpora and treebanks.
Machine readable (bilingual) dictionaries
Morphological Analyzers
Parsers
…
Conclusion (4)

We need to secure fund for:



Exchanging visits (experience Expert
Network)
Buy software
Secure dedicated RA’s and/or PhD students
for the NLP task.
References (1) - Journals



Khaled Shaalan, Hafsa Raza, NERA: Named Entity
Recognition for Arabic, the Journal of the American Society
for Information Science and Technology (JASIST), John
Wiley & Sons, Inc., NJ, USA, 60(7):1–12, July 2009.
Shaalan, K., Monem, A. A., Rafea, A., Arabic Morphological
Generation from Interlingua: A Rule-based Approach, in
IFIP International Federation for Information Processing,
Vol. 228, Intelligent Information Processing III, eds. Z. Shi,
Shimohara K., Feng D., (Boston:Springer), PP. 441-451,
2006.
Shaalan, K., Talhami H., and Kamel I., Morphological
Generation for Indexing Arabic Speech Recordings, The
International Journal of Computer Processing of Oriental
Languages (IJCPOL), World Scientific Publishing Company,
20(1)1:14, 2007.
References (2) - Journals



Shaalan K. An Intelligent Computer Assisted Language
Learning System for Arabic Learners, Computer Assisted
Language Learning: An International Journal, Taylor &
Francis Group Ltd., 18(1 & 2): 81-108, February 2005.
Shaalan K. Arabic GramCheck: A Grammar Checker for
Arabic, Software Practice and Experience, John Wiley & sons
Ltd., UK, 35(7):643-665, June 2005.
Shaalan K.,
Rafea, A., Abdel Monem, A., Baraka, H.,
Machine Translation of English Noun Phrases into Arabic,
The International Journal of Computer Processing of Oriental
Languages (IJCPOL), World Scientific Publishing Company,

17(2):121-134, 2004.
Rafea A., Shaalan K., Lexical Analysis of Inflected Arabic
words using Exhaustive Search of an Augmented Transition
Network, Software Practice and Experience, John Wiley &
sons Ltd., UK,23(6):567-588, June 1993.
References (3) – workshops &
conferences

Hosny, A., Shaalan, K., Fahmy, A., Automatic Morphological
Rule Induction for Arabic, In the Proceedings of The
LREC'08 workshop on HLT & NLP within the Arabic world:
Arabic Language and local languages processing: Status
Updates and Prospects, 31st May, PP. 97-101, 2008.

Shaalan, K., Abo Bakr, H., Ziedan, I., Transferring Egyptian
Colloquial into Modern Standard Arabic, International
Conference on Recent Advances in Natural Language
Processing (RANLP – 2007) , Borovets, Bulgaria, PP. 525
529, September 27-29, 2007.
Shaalan, K., Abdel Monem, A., Rafea, A., Baraka, H.,
Generating Arabic Text from Interlingua, In the Proceedings
of the 2nd Workshop on Computational Approaches to
Arabic Script-based Languages, CAASL-2, Linguistic
Institute, Stanford, California, USA, PP. 137-144, July 21-22,
2007.
References (4) – workshops &
conferences

Othman E., Shaalan K., and Rafea A., Towards
Resolving Ambiguity in Understanding Arabic
Sentence, In the Proceedings of the International
Conference on Arabic Language Resources and Tools,

NEMLAR, PP. 118-122, 22nd–23rd Sept., Egypt, ,
2004.
Othman E., Shaalan K., and Rafea A. A Chart Parser
for Analyzing Modern Standard Arabic Sentence, In
proceedings of the MT Summit IX Workshop on
Machine Translation for Semitic Languages: Issues
and Approaches, New Orleans, Louisiana, USA.,
September, 2003.
Thank you!
Merci!
Shukran!
‫شكرا‬