Shaalan Rule
Download
Report
Transcript Shaalan Rule
Rule-based approach in Arabic
NLP: Tools, Systems and
Resources
Dr Khaled Shaalan
Professor, Faculty of Computers &
Information, Cairo University
On Secondment to BUiD, UAE
Khaled.shaalan@{buid.ac.ae,
gmail.com}
CITALA2009 - Morroco
Agenda
Objective
Language Tasks
NLP Approaches
Rule-based Arabic Analysis and generation
tools
Rule-based Arabic NLP applications
Some Arabic NLP Free Resources
Major and Arabic mailing lists
Conclusion
Objective
To show how rule-based approach has
successfully used to develop Arabic
natural language processing tools and
applications.
Separating Language Tasks
English vs. French vs. Arabic vs . . .
spoken language (dialogue) vs written test vs
hand written script
Genuine Script vs transliterated (Romanized)
script
Vocalized (vowelized) vs non-vocalized
Understanding vs. generation
First language learner vs second language
learner
Classical or Qur’anical Arabic vs Modern Standard
Arabic vs colloquial (dialects)
Stem-based vs root-based
Rules
Situation/Action
If match(stem.prefix, def_article)
then romve(stem.prefix,Stem_FS)
If match(stem.definitness,indefinite)
then morph_gen(stem.definitness,Stem_FS)
Common Mistake
Rule-based approach is not a rule-based
expert systems !!!!!!!
Both consist of rules.
Rule-based expert systems solves the
problem by Recognize-Act Cycle
Loop
Conflict resolution strategy
Recognize-Act Cycle
Domain Knowledge
Rule
Base
loop
1.
Match: Rules are
compared to working
memory to determine
matches. if no rule
matches then stop
2.
Conflict Resolution:
Select or enable a
single rule for execution
3.
Execute: Fire the
selected rule
•
Add new fact, or
•
Learn a new rule
end loop
Match
n
Conflict
Resolution
Fact
Base
New
Rule
1
Execute
New
Fact
Working Memory
7
NLP Approaches
Rule-based
Statistical-based
NLP Approaches (1)
Relies on handconstructed rules that
are to be acquired from
language specialists
requires only small
amount of training data
development could be
very time consuming
developers do not
need language
specialists expertise
requires large amount
of annotated training
data (very large
corpora)
automated
NLP Approaches (2)
some changes may be hard
to accommodate
not easy to obtain high
coverage of the linguistic
knowledge
useful for limited domain
Can be used with both wellformed and ill-formed input
High quality based on solid
linguistic
some changes may
require re-annotation of
the entire training corpus
Coverage depends on the
training data
Not easy to work with illformed input as both wellformed and ill-formed are
still probable
Less quality - does not
explicitly deal with syntax
Rule-based Arabic NLP tools
Morphological Analyzers
Morphological Generators
Syntactic Analyzers
Syntactic Generators
Rule-based Arabic
Morphological Analyzer
Morphological Analysis
Breakdown the inflected Arabic word into a root/stem,
affixes, features.
Example: sa- ‘uEty- kumA ( )ﺳﺄﻋﻂﯾﻜﻤﺎ- ‘will I give you…’
ﺳ: saTYPE: Particle
INFLECTION:
‘Future’
ﺄﻋﻂﯾ: -‘uEtyTYPE: VERB
ASPECT: IMPERF
MOOD: IND
PERS: 1
GENDER: M/F
NUMBER: SG
SUBJ: I
ﻜﻤﺎ: -kumA
TYPE: AFFPR
GENDER: M/F
NUMBER: DUAL
GF: OBJ
Rules - Augmented Transition
Network (ATN) technique
Rules associated with arcs represent the
context-sensitive knowledge about the
relation between a root and inflections.
More than one rule may be associated with
one arc.
Conditions associated with the arcs are
placed in such a way that the arc to be
traversed first is the one that leads to the
most probable solution.
Arabic Morphology using ATN Technique
Types of Rules
Remove Prefix or Suffix
Remove doubled letter
Add/change Hamza, Weak letter,…
…
Analysis of the verb ""شاهدتك
(I saw you): Remove suffixes
شاهدت
شاهدتك
last1 = “”ك
S0
S1
last2 = “”ت
S2
شاهد
S3
•stem: "( "شاهدsaw)
• perfect
•1st person sg pronoun: ""ت
•2nd person sg pronoun ""ك
S10
Analysis of the verb “”يلعبون
(they are playing): Remove prefix & suffix
لعبون
Begin2 = “لعبون ”ي
S0
S1
last2 = “”ون
S2
لعب
S3
•stem: “( "لعبplayed)
• imperfect
•Plural subject
S10
Issues in the morphological
analysis
Overgeneration (too many output)
Ambiguity
Reconstruction of vowels
MultiWord/compound Expressions
Out-of-Vocabulary (OOV)
Handling ill-formed input
Detection (spell checking)
Correction- relaxation “ ”هinstead of “”ة
Prevent ill-formed output
Check the compatibility (the prefix “ ”فcannot come
after the prefix “( ”بor “))”ك.
Rule-based Arabic
Morphological Generator
Morphological generation
Synthesis of an inflected Arabic word from a
given root/stem according to a combination
of morphological properties that include:
definiteness (definite article “)”ال,
gender (masculine, feminine),
number (singular, dual, plural),
case (nominative, genitive, accusative,…),
person (first, second, third)
…
Types of Rules
synthesis of inflected
Noun
Verb
particle
Synthesis of inflected Nouns
definite noun
feminine noun
pluralize noun
dual noun
attach a prefix preposition
attach a suffix pronoun
end case
….
Synthesis of feminine noun
If noun.gender = masculine
Then attach suffix feminine letter
Example:
“) ”زوجhusband) “( ”زوجةwife)
Synthesis of suffix pronoun
If pronoun.person = first and
pronoun.number = singular
Then attach first person singular suffix
pronoun
Example:
“( ”زوجةwife) “( ”زوجتيmy wife)
Synthesis of inflected Verbs
(very complex-rich in form and meaning)
conjugate
conjugate
conjugate
conjugate
….
a
a
a
a
verb
verb
verb
verb
with
with
with
with
tense
number
prefix pronoun
suffix pronoun
Rule: synthesize first person plural of
assimilated verbs
Input: first person singular past verb
Output: inflected verb
Example: وصلنا- سنصل-نصل
If verb.tense = future
then remove first weak & attach_prefix(")"سن
else if verb.tense = present
then remove first weak & attach_prefix(")"ن
else attach_suffix(verb.stem,")"نا
Issues in the morphological
generation
MultiWord/compound Expressions
Out-of-Vocabulary (OOV)
Some forms need special handling:
Substitution: This man – هذا الرجل
literal numbers (complex nouns)
Arabic script
‘ ’ل+ ‘ ’ال ‘’للـ
“ ”ي+ “ ”زمالء ‘ ’زمالءي ‘’زمالئي
“ ”غرفة “”غرفتان
Rule-based Arabic Syntactic
Analyzer
Types of Rules
Grammatical rules:
Describe sentence and phrase structures,
and ensure the agreement relations
between various elements in the sentence.
Parsing
Accepts the input and generates the
sentence structure (parse tree)
Parsing of the sentence “”الطالبة مجتهدة
The student (sg,f) is diligent (sg,f)
الطالبة مجتهدة
noun (definite, fem, sg)
noun (indefinite, fem, sg)
definite(definite, fem, sg)
enunciative (indefinite, fem, sg)
Inchoative (defined, fem, sg)
nominal sentence
Agreement:
•Number
•Gender
Nominal sentence -> definite_Inchoative(Number,Gender)
indefinite_enuciative(Number,Gender)
Issues in the syntactic analysis
Ambiguity (more than parse tree)
Disambiguation techniques
Handling ill-formed input
Detection (grammar checking)
Recovering (Partial parsing - parses =
chunks to be related)
Rule-based Arabic Syntactic
Generator
Types of Rules
Determine phrase structures
Determine syntactic structure
Ensure the agreement relations
between various elements in the
sentence.
Rule: verb-subject agreement
Input: verb and inflected subject (a preverbal NP )
Output: inflected verb agreed with its
inflected subject
synthesize_verb(Subject.number,verb.stem)
synthesize_verb(Subject.gender,verb.stem)
An agreement example:
األوالد زاروا خمس متاحف قديمة
the-boys visited-they five museum old
The boys visited five old museums
قديمة
Adj-noun
(G)
متاحف
خمس
counted-Num
(G)
زاروا
األوالد
verb-Subject
(N,G)
Issues in the syntactic
generation
Word order (VSO,SVO, etc.)
Agreement (full/partial)
dropping the subject pronoun (called Pro-drop),
i.e., to have a null subject, when the inflected
verb includes subject affixes.
Syntax that captures the source/intended meaning
My son is 8 = أبني عمره ثماني سنوات
I did not understand the last sentence = أنا لم أفهم الجملة
األخيرة
A Rule-based Arabic NLP
applications
Named Entity Recognition
Machine translation
Transferring Egyptian Colloquial Dialect
into Modern Standard Arabic
What is entity recognition?
Identifying, extracting, and normalizing
entities from documents such as names
of people, locations, or companies.
Makes unstructured data more
structured
Politics of Ukraine
In July 1994, Leonid Kuchma was elected as Ukraine's second president in free and fair
elections. Kuchma was reelected in November 1999 to another five-year term, with 56
percent of the vote. International observers criticized aspects of the election, especially
slanted media coverage; however, the outcome of the vote was not called into question. In
March 2002, Ukraine held its most recent parliamentary elections, which were
characterized by the Organization for Security and Cooperation in Europe (OSCE) as
flawed, but an improvement over the 1998 elections. The pro-presidential For a United
Ukraine bloc won the largest number of seats, followed by the reformist Our Ukraine bloc
of former Prime Minister Viktor Yushchenko, and the Communist Party. There are 450
seats in parliament, with half chosen from party lists by proportional vote and half from
individual constituencies.
Entity Extractor
Person
Date
Location
Person Entity Recognition (1)
Example: ‘’الملك األردني عبد هللا الثاني
The Jordanian king Abdullah II
We want to have a rule that recognizes a
person name composed of a first name
followed by optional last names, based on
a preceding person indicator pattern.
Person Entity Recognition (2)
The Rule component of this example:
Name Entity: [عبد هللاAbdullah]
indicator pattern:
an honorific such as "[ "الملكThe king]
Nasab: (optional) inflected from a location name
"[ "األردنيJordanian].
The rule also matches an optional ordinal
number appearing at the end of some names such
as "[ "الثانيII].
Person Entity Recognition (3)
((honorfic+(location()?))ي|ية+
first_Name(last_Name)?+(number)?)
This (Regular Expression) rule can recognize:
الملك عبد هللا
الملك األردني عبد هللا
الملك األردني عبد هللا الثاني
الملكة األردنية رانيا
…
Issues in the Arabic NER
Complex Morphological System
(inflections)
Non-casing language (No initial capital
for proper nouns)
Non-standardization and inconsistency
in Arabic written text (typos, and
spelling variants)
Ambiguity
Machine Translation
Direct
Transfer
Interlingua
MT Approaches
MT Pyramid
Interlingua
Source syntax
Source word
Analysis
Transfer
Direct
Target syntax
Target word
Generation
English-to-Arabic Transfer based Approach
source sentence
(English)
English Dic.
Sentence Analysis
Morphological
& syntactic Analysis
Rules of English
English Parse Tree
Bi-ling Dic.
Transfer
English-to-Arabic
Transformation Rules
Arabic Parse Tree
Arabic Dic.
Sentence Synthesis
Target sentence
(Arabic)
Morphological Gen. &
Synthesis Rules of
Arabic
Transfer approach
Involves analysis, transfer, and
generation components
If you have an Arabic parser & Arabic
syntactic generator, All you need is to
acquire the transfer rules and build the
transfer component
Simple Transfer
(1)
[wi:$1, wi+1:$2, …, wk:$k] (1 i k)
[wk:$k, wk-1:$k-1, …, wi:$i] (1 i k)
Networks performance evaluation تقييم أداء شبكة
np
np
noun
noun
np
networks
pl
noun
performa
nce
sg
transfer
np
noun
evaluation
sg
تقييم
sg
np
noun
أداء
sg
np
noun
شبكة
pl
Issues in the Transfer-based
MT approach
Synonyms of a word
Agreement
Acquisition “ ”اكتسابor “”استخالص.
intelligent tutoring systems “ نظم التعليم
”الذكيةor “”نظم التعليم الذكي
Problems with prepositions
did you do fungal analysis?
“”هل قمت بـتحليل الفطر؟
…
Interlingua MT – Multilingual
translation
Interlingua = Semantic Representation
Deep analysis –
no need for transfer component)
Only analysis and generation components
Add Arabic analyzer to translate to
other languages
Add Arabic generator to translate from
other languages
Analysis of Arabic to Interlingua
أنا أرغب في حجز غرفة في الفندق:العميل
Preprocessor
Sentence
Analyzer
Arabic
Lexicon
Morphological
Analyzer
Arabic Grammar
Rules
Arabic
Morphology Rules
Parse Tree
Map
Lexicon
Mapper
Ontology
Interlingua(IF)
c:introduce-topic+reservation+disposition+room (room-spec=(room,
specifier=hote,identifiability=yes),disposition=(desire,who=i))
Generating Arabic from Interlingua
Interlingua(IF)
c:introduce-topic+reservation+disposition+room (room-spec=(room,
specifier=hote,identifiability=yes),disposition=(desire,who=i))
Map
Lexicon
Mapper
Feature Structure
Ontology
Map Rules
Sentence
Generator
Arabic
Lexicon
Arabic Grammar
Rules
Morphological
Generator
أنا أرغب في حجز غرفة في الفندق:العميل
Arabic Morphology
Rules
Issues in the interlingua
approach
Interlingua:
language-neutral representation
captures the intended meaning of the
source sentence
Requires a fully-disambiguating parser
Transferring Egyptian Colloquial
Dialect into Modern Standard Arabic
Be able to reuse MSA processing tools
with colloquial Arabic by transferring
colloquial Arabic words into their
corresponding MSA words.
Facilitate the communication with
colloquial Arabic speakers
Restore the Arabic dialect to the
standard language in use nowadays.
A one-to-one transfer example
امتي؟
Mapping
متي؟
when?
A one-to-many transfer example
عال
On-the
Mapping
ال
the
علي
on
A complete sentence example
جيت امتي؟
You-came when?
Mapping
جئت متي؟
reordering
متي جئت؟
•Step (1)
• جيت جئت
• امتي متي
•Step (2)
• the New Segment Position for
the word “ ”امتىis
start of sentence (SoS)
When did-you-come ?
Issues in the transfer to MSA
More investigations are needed
Arabic NLP Free Resources
Arabic NLP Free Resources
Arabic Morphological Analyzers
Tim Buckwalter Morphological
http://www.qamus.org/
http://www.ldc.upenn.edu/Catalog/CatalogE
ntry.jsp?catalogId=LDC2002L49
Xerox
http://www.cis.upenn.edu/~cis639/a
rabic/input/keyboard_input.html
Arabic Morphological Analyzers
Aramorph
http://www.nongnu.org/aramorph/english
/index.html
Arabic spell checker
Aspell
http://aspell.net/
http://www.freshports.org/arabic/aspell
Arabic Morphological
Generation
Sarf
http://sourceforge.net/projects/sarf
Tokenization & POS tagging
ArabicSVMTools: The tools utilize the
Yamcha SVM tools to tokenize, POS tag
and Base Phrase Chunk Arabic text
http://www1.cs.columbia.edu/~mdiab/
http://www1.cs.columbia.edu/~mdiab/softw
are/AMIRA-1.0.tar.gz
Tokenization & POS tagging
MADA: a full morphological tagger for
Modern Standard Arabic.
http://www1.cs.columbia.edu/~rambow/soft
ware-downloads/MADA_Distribution.html
POS tagging
Stanford Log-linear Part-Of-Speech Tagger
http://nlp.stanford.edu/software/tagger.sht
ml
http://nlp.stanford.edu/software/stanfordarabic-tagger-2008-09-28.tar.gz
Tokenization & POS tagging
Attia's Finite State Tools for Modern
Standard Arabic
http://www.attiaspace.com/getrec.asp?rec=
htmFiles/fsttools
Arabic Parsers
Dan Bikel’s Parser
http://www.cis.upenn.edu/~dbikel/
http://www.cis.upenn.edu/~dbikel/software.ht
ml
Attia Arabic Parser
http://www.attiaspace.com/
http://decentius.aksis.uib.no/logon/xle.xml
Arabic wordnet
Arabic WordNet
http://www.globalwordnet.org/AWN/
http://personalpages.manchester.ac.uk/staff
/paul.thompson/AWNBrowser.zip
Translation resources
Tools: GIZA++, MOSES, Pharaoh,
Rewrite and BLEU
http://www.statmt.org/
APIs:
http://code.google.com/apis/ajax/playgroun
d/#translate
http://code.google.com/apis/ajax/playgroun
d/#batch_translate
Transliterate
Transliterate
http://code.google.com/apis/ajax/playgroun
d/#transliterate_arabic
Mailing Lists – just to be
connected to the NLP community
[email protected]
[email protected]
http://www.linguistlist.org/
[email protected]
http://mailman.uib.no/listinfo/corpora
http://www.semitic.tk/
[email protected]
http://www.arabicscript.org/CAASL3/index.html
Conclusion (1)
Arabic requires the treatment of the
language constituents at all levels:
morphology, syntax, and semantics.
Most of the researches in Arabic NLP
are mainly concentrated on the analysis
part aiming at automated
understanding of Arabic language.
Conclusion (2)
Arabic NLP in general is significantly
under developed.
In order to bridge this gab and help
Arabic NLP research to catch up with
the many recent advances of Latin
languages, we need collaborative
efforts from the Arabic research
community.
Conclusion (3)
We need Public Domain (in Electronic
Form) for:
Linguistic resources such as large Arabic
(bilingual) Corpora and treebanks.
Machine readable (bilingual) dictionaries
Morphological Analyzers
Parsers
…
Conclusion (4)
We need to secure fund for:
Exchanging visits (experience Expert
Network)
Buy software
Secure dedicated RA’s and/or PhD students
for the NLP task.
References (1) - Journals
Khaled Shaalan, Hafsa Raza, NERA: Named Entity
Recognition for Arabic, the Journal of the American Society
for Information Science and Technology (JASIST), John
Wiley & Sons, Inc., NJ, USA, 60(7):1–12, July 2009.
Shaalan, K., Monem, A. A., Rafea, A., Arabic Morphological
Generation from Interlingua: A Rule-based Approach, in
IFIP International Federation for Information Processing,
Vol. 228, Intelligent Information Processing III, eds. Z. Shi,
Shimohara K., Feng D., (Boston:Springer), PP. 441-451,
2006.
Shaalan, K., Talhami H., and Kamel I., Morphological
Generation for Indexing Arabic Speech Recordings, The
International Journal of Computer Processing of Oriental
Languages (IJCPOL), World Scientific Publishing Company,
20(1)1:14, 2007.
References (2) - Journals
Shaalan K. An Intelligent Computer Assisted Language
Learning System for Arabic Learners, Computer Assisted
Language Learning: An International Journal, Taylor &
Francis Group Ltd., 18(1 & 2): 81-108, February 2005.
Shaalan K. Arabic GramCheck: A Grammar Checker for
Arabic, Software Practice and Experience, John Wiley & sons
Ltd., UK, 35(7):643-665, June 2005.
Shaalan K.,
Rafea, A., Abdel Monem, A., Baraka, H.,
Machine Translation of English Noun Phrases into Arabic,
The International Journal of Computer Processing of Oriental
Languages (IJCPOL), World Scientific Publishing Company,
17(2):121-134, 2004.
Rafea A., Shaalan K., Lexical Analysis of Inflected Arabic
words using Exhaustive Search of an Augmented Transition
Network, Software Practice and Experience, John Wiley &
sons Ltd., UK,23(6):567-588, June 1993.
References (3) – workshops &
conferences
Hosny, A., Shaalan, K., Fahmy, A., Automatic Morphological
Rule Induction for Arabic, In the Proceedings of The
LREC'08 workshop on HLT & NLP within the Arabic world:
Arabic Language and local languages processing: Status
Updates and Prospects, 31st May, PP. 97-101, 2008.
Shaalan, K., Abo Bakr, H., Ziedan, I., Transferring Egyptian
Colloquial into Modern Standard Arabic, International
Conference on Recent Advances in Natural Language
Processing (RANLP – 2007) , Borovets, Bulgaria, PP. 525
529, September 27-29, 2007.
Shaalan, K., Abdel Monem, A., Rafea, A., Baraka, H.,
Generating Arabic Text from Interlingua, In the Proceedings
of the 2nd Workshop on Computational Approaches to
Arabic Script-based Languages, CAASL-2, Linguistic
Institute, Stanford, California, USA, PP. 137-144, July 21-22,
2007.
References (4) – workshops &
conferences
Othman E., Shaalan K., and Rafea A., Towards
Resolving Ambiguity in Understanding Arabic
Sentence, In the Proceedings of the International
Conference on Arabic Language Resources and Tools,
NEMLAR, PP. 118-122, 22nd–23rd Sept., Egypt, ,
2004.
Othman E., Shaalan K., and Rafea A. A Chart Parser
for Analyzing Modern Standard Arabic Sentence, In
proceedings of the MT Summit IX Workshop on
Machine Translation for Semitic Languages: Issues
and Approaches, New Orleans, Louisiana, USA.,
September, 2003.
Thank you!
Merci!
Shukran!
شكرا