ppt - School of Computing

Download Report

Transcript ppt - School of Computing

Arabic Language Computing applied to
the Quran
- a PhD research project by
Kais Dukes
I-AIBS Institute for Artificial Intelligence
and Biological Systems
School of Computing
University of Leeds
The Challenge: An interdisciplinary
approach to understanding the Quran
(1) Quranic
Studies
(2) Traditional
Arabic
Linguistics
(3)
Computational
Linguistics
(1) What is the Quran?
The last in a series of 5 religious texts
Holy Book
Prophet
Text Dated
Suhuf Ibrahim (Scrolls)
Abraham
?
The Tawrat (Torah)
Moses
1500 BCE?
The Zabur (Psalms)
David
1000 BCE?
The Injil (Gospel)
Jesus
1 CE
The Quran
Muhammad (PBUH)
610-632 CE
(1) What is the Quran?
The central religious text of Islam
-Classical Arabic, 1300+ years ago
- All believers should learn the text;
translations are “interpretations”
- Islamic Law (legal logic)
- Divine guidance & direction
- Science and philosophy
- Has inspired Algebra, Linguistics
(2) Traditional Arabic Linguistics
Originated in Arabs studying the language of
the Quran (scientific analysis for at least 1000
years – a lot older than English language!):
- Orthography (diacritics and vowelization)
- Etymology (Semitic roots)
- Morphology (derivation and inflection)
- Syntax (origins of dependency grammar)
- Discourse Analysis & Rhetoric
- Semantics & Pragmatics
(3) Computational Linguistics
Quran is online, for keyword search
BUT verse-by-verse translations are interpretations
Muslims should access the “true” Classical Arabic source
(3) Computational Linguistics
- How far can we go?
- Is an Artificial Intelligence system realistic?
Example question-answering dialog system:
Question
How long should I breastfeed my child for?
Answer Mothers should suckle their offspring
for two years, if the father wishes to complete
the term (The Holy Quran, Verse 2:233).
An AI approach to understanding
the Quran
Central Hypothesis
Augmenting the text of the Quran with rich
annotation will lead to a more accurate AI
system.
- Prepare the data by annotating the Quran.
- Use the data to build an AI system for
concept search and question-answering.
Annotating the Quran
Challenges
Orthography - Complex non-standard script
Morphology (word structure) - Arabic is highly
inflected, challenging to analyze
Grammar - Phrase structure, dependency
Semantics – Ontology of Entities and Concepts
referred to by pronouns and nouns
Annotating the Quran
Solutions
- Computing advances have made annotation
possible, to high accuracy
- Leverage existing resources from Traditional
Arabic Grammar
-Machine-Learning annotation followed by
manual verification
-- Community effort using online volunteers
Recent Advances: Orthography
An accurate digital copy of the Quran?
Encoding Issues
- Missing diacritics
- Simplified script (not
Uthmani)
- Windows code page
1256, not Unicode
Google Search for verse (68:38) on Jan 21, 2008
shows many typos
Recent Advances: Orthography
Tanzil Project (http://tanzil.info)
- Stable version released May 2008
- Uses Unicode XML encoding, including
the special characters designed for the
complex Arabic script of the Quran
- Manually verified to 100% accuracy by a
group of experts who have memorized the
entire text of the Quran
Recent Advances: Orthography
Java Quran API (http://jqurantree.org)
(Dukes 2009)
- Java classes for
querying the Tanzil
XML of the Quran
- gives authentic
script on web-pages
Recent Advances: Morphology
- Buckwalter Arabic Morphological Analyzer
(Tim Buckwalter, 2002)
- Morphological Analysis of the Quran at the
University of Haifa (Shuly Wintner, 2004)
- Lexeme & feature based morphological
representation of Arabic (Nizar Habash, 2006)
The Haifa Corpus (2004)
Multiple analysis for each word (up to 5)
rbb+fa&l+Noun+Triptotic+Masc+Sg+Pron+Dependent+1P+Sg
rbb+fa&l+Noun+Triptotic+Masc+Sg+Gen
Not manually verified
Authors reports an F-measure of 86%
Non-standard annotation scheme
not familiar to traditional Arabic linguists
e.g. extracting a list of all verbs is non-trivial
Arabic text is only encoded phonetically
instead of using the original Arabic.
e.g. searching for a specific root is not easy
The Quranic Arabic Corpus
http://corpus.quran.com/
Kais Dukes Arabic Language Computing Applied
to the Quran – PhD (part-time)
word structure - colour-coded morphological analysis
translation - word-for-word English translations
grammar- dependency parse following Arabic tradition
semantics – ontology of entities and concepts
Machine Learning - annotations used for A.I. training
Impact - dozens of researchers have collaborated/cited,
and a million visitors have used the website this year
The Quranic Arabic Corpus
Verified Uthmani Script
- Unicode Uthmani Script
- Sourced from the verified Tanzil project
The Quranic Arabic Corpus
Phonetics (faja'alnāhumu)
- Phonetic transcription generated algorithmically
- Guided by Arabic vowelized diacritics
The Quranic Arabic Corpus
Interlinear translation
- Word-for-word translation from accepted sources
- Interlinear translation scheme
The Quranic Arabic Corpus
Location Reference (21:70:4)
- Common standard for verses (Chapter:Verse)
- Extended in the QAC corpus to include word numbers
and segment numbers, e.g. (21:70:4:2)
The Quranic Arabic Corpus
Morphological Segmentation
- Division of a single word into multiple segments
- Part-of-speech tag assigned to each segment
- Traditional Arabic Grammar rules used for division
The Quranic Arabic Corpus
Morphological segment features
The Quranic Arabic Corpus
Arabic Grammar Summary
The Quranic Arabic Treebank
Syntactic Annotation
- Dependency Grammar based on ‫( إعراب‬i'rāb)
- Syntactico-semantic roles for each word
The Quranic Arabic Treebank
Ontology of entities and concepts
- linked to/from nouns and pronouns in the text
The Quranic Arabic Treebank
Framework for collaboration
Message Board:
“If you come across a word and you feel that a better
analysis could be provided, you can suggest a
correction online by clicking on an Arabic word”
(currently 5228 resolved messages; 1048 under review)
Resources:
Publications; Citations, Reviews, FAQs, Feedback,
Data Download, Software download, Mailing list
The Quranic Arabic Treebank
Users: researchers, public
- Artificial Intelligence and Computational Linguistics
- Arabic linguistics
-Quranic and Islamic Studies
-Classical literature analysis
-Anyone who wants to appreciate the Quran
The Quranic Arabic Treebank
new Computational Linguistics?
- First Treebank of Classical Arabic
- Free Treebank of the Quran
- First formal representation of
Traditional Arabic Grammar using
constituency/dependency graphs
- Machine-Learning parser
The Quranic Arabic Corpus
Part-of-speech Tagging
Part-of-speech Tag
N
PN
PRON
DEM
REL
ADJ
V
P
PART
INTG
VOC
NEG
FUT
CONJ
NUM
T
LOC
EMPH
PRP
IMPV
INL
Name
Arabic Name
Noun
‫اسم‬
Proper noun
‫اسماء علم‬
Personal pronoun
‫ضمير‬
Demonstrative pronoun ‫اسم اشارة‬
Relative pronoun
‫اسم موصول‬
Adjective
‫صفة‬
Verb
‫فعل‬
Preposition
‫حرف جر‬
Particle
‫حرف‬
Interrogative particle
‫حرف استفهام‬
Vocative particle
‫حرف نداء‬
Negative particle
‫حرف نفي‬
Future particle
‫حرف استقبال‬
Conjunction
‫حرف عطف‬
Number
‫رقم‬
Time adverb
‫ظرف زمان‬
Location adverb
‫ظرف مكان‬
Emphatic lām prefix
‫الم التوكيد‬
Purpose lām prefix
‫الم التعليل‬
Imperative lām prefix
‫الم االمر‬
Quranic initials
‫حروف مقطعة‬
-Part-of-speech tags
adapted from Traditional
Arabic Grammar, and
mapped to English
equivalents (not the other
way around)
- These tags apply to
words in the Quran, as
well as to individual
morphological segments in
the text
Automatic Annotation
Classical Arabic Dependency Parser
- Joakim Nivre (2009)
dependency parsing
using a shift/reduce
queue/stack architecture
with machine learning
-
- Following similar
architecture, but with
hand written rules,
custom parser has an
F-measure of 77.2%
University of Leeds Postgraduate
Researcher Conference 2011
Criteria for “PGR Researcher of the Year 2011”
• Ability to communicate research to the lay and
non-specialist research audience
• Impact/potential impact of the research in
terms of e.g. application of findings for
economic or social benefit; the significance of
the contribution/potential contribution of the
research to the academic subject area
• Evidence of local or national publicity or public
engagement.
Ability to communicate research to
the lay and non-specialist audience
Example Feedback (319 comments)
“I would like to applaud you for your effort”
Prof Behnam Sadeghi, Stanford University
“We are big admirers of the work” Prof Gregory
Crane, Classics Dept, Tufts University
“I regularly use your work on the Qur'an and
read it whenever I can.” Prof Yousuf Islam,
Director, Daffodil International University
“Congratulations to all concerned on this
project” - Prof Michael Arthur, VC, Leeds Uni
Impact: application of findings for
economic or social benefit
Over a million users already, and growing; many
unforseen social benefits, eg:
“I work as a chaplain in correctional centers in
the State of Missouri, U.S.A. Thanks for your
permission to use the Quranic Arabic Corpus
in these correctional centers” Tadar Wazir.
Impact: significance of the research
to the academic subject area
•
•
•
•
•
10 papers in research conferences & journals
25 citations (from Google Scholar) - so far...
Positive feedback from top researchers
Only free-to-download Arabic treebank
A de-facto standard data-set for AI research
Evidence of local or national
publicity or public engagement
Newspapers, eg Muslim Post; better still:
Website – world-wide public engagement!
Conclusion
This is not the end
to come: 2nd half of PhD project;
and more?
Kais Dukes
I-AIBS Institute for Artificial Intelligence
and Biological Systems
School of Computing
University of Leeds