slides

Transcript slides

Investigating the Ancient Meroitic Language
Using Statistical Natural Language Techniques:
Zipf’s Law and Word Co-Occurrences
Reginald
Smith
August 10, 2006
Sudan Studies Association Conference
Rhode Island College
Meroitic is the language of the
ancient kingdom of Kush
• Used for almost six hundred years from
2nd century BCE to 4th century CE
• Phonetic language written right to left (like
Arabic)
• Transliteration made possible by work of
British archaeologist FL Griffith around
1910
Meroitic remains largely
undeciphered and an enigma
• No complete vocabulary is available
• Some words such as place names, loan words,
or simple concepts are known
– For example or
– Perhaps
“qore” means king
or “qes” is Kush
• Many attempts have been made to understand
Meroitic using phonology or comparative
linguistics
– Scholars have tried in vain to find a known language
that is a relative (see sources in paper)
– We wish we had a bilingual text like the Rosetta stone
to guide us
A new method could use
mathematics and linguistics
• Statistical natural language processing
analyzes the properties of language using
a mix of statistics and linguistics
• There are several properties of languages
that are the same in all human languages
• Certain techniques can also help us
possibly infer meanings of words (by
relating them to other known words)
Zipf’s Law: Frequencies of Words
• If you rank order words in a text by how frequent
(# of times a word appears) they are (#1 being
most frequent) and then relate this to the
frequency of the word, you get Zipf’s Law
• Zipf’s Law: where F is the frequency of a word,
C is a constant, R is the rank, and α is known as
the power law exponent
F  CR

(1)
• For all languages α ≈ 1
Zipf Law Graphs
• When you graph the frequency vs. the rank on a log-log
graph (graphing the logarithm of frequency vs. the
logarithm of rank) you get a straight line whose slope is α
Zipf line fit on
data. The red
line is the fitted
slope on the
data points
Picture Source: University of Helsinki CS department
Does Meroitic follow Zipf’s Law?
• The two graphs below show log-log plots of frequency
vs. rank for the Meroitic words in 69 texts. The slopes
are shown for each
– The normal plot counts the words as is. The morpheme out plot
split out suffixes like –lowi as the separate words “lo” and “wi”
– Since it has a slope of nearly -1 the morpheme out model of
Meroitic seems to follow Zipf’s Law
Normal
plot
Slope =
-0.81
Morpheme
out plot Slope
= -1.03
So what does this show us
(besides graphs)
• Despite the apparently low amount of texts
available, our sample of Meroitic is structured
just like all other human languages (English,
Chinese, etc.)
• Therefore, even though we don’t know the
meaning of the words, we know that the
language we have is representative
– Even though most of our samples are redundant
funeral stelae
• We can then proceed to use other statistical
techniques on Meroitic and also compare its
statistical features to other languages
Step Two: Word Co-occurrence
• When words occur together in a text, they are
said to co-occur
– “I am here” has co-occurrence between “I-am” and
“am-here”
• Co-occurrences can tell us about the words if we
have enough of them
– Words that co-occur with the same words often have
similar parts of speech or even meanings
– Can we use word co-occurrence in Meroitic to
analyze classes of words?
What I did with Meroitic
• I analyzed Meroitic by matching together words
that co-occurred with the same types of words
• For example if you have two sentences: “I eat
horses” and “We eat lizards”
– I match “I” and “We” because they both co-occur with
“eat”
– I also match “horses” and “lizards” because they also
co-occur with “eat” (in the opposite direction*)
• I then graph connected words together and
analyze them with software
– What happens?
*Technical note: I actually used undirected edges for co-occurring words in the graph shown on the next page
Meroitic Words Graph
Group 3
Group 4
Group 1
Group 2
• Four main groups of words form that
correspond well to Meroitic categories
including positions and titles, verbs,
places, and miscellaneous nouns
Results
• Techniques like the word co-occurrence
matching can help us categorize Meroitic
words that we previously guessed on by
mapping them against words we already
know the part of speech for
• Similar statistical techniques may allow us
to match words with a similar “meaning” to
infer the meanings of some words
– This is still speculative though
Conclusion
• Statistical natural language processing is a new
approach to Meroitic that could supplement
other current efforts in the language
• Much more work remains to be done, but this
new avenue may help us move closer to the
goal of understanding this beautiful and
mysterious language
• Acknowledgements: I give my boundless
appreciation to Dr. Richard Lobban and Dr.
Laurance Doyle for the help and advice they
gave me on this paper’s topics

slides

Transcript slides

Directory