Transcript slides
FIRE 2013
Presentation on
:
Transliterated Search using Syllabification
Approach
By:Hardik Joshi1,
Apurva Bhatt1,
Honey Patel2
{hardikjjoshi,apurva.bhatt7,Honeypatel.39}@gmail.com
1Department
of Computer Science, Gujarat University, Ahmedabad, India.
2L.J. College of Engineering, Ahmedabad, India
Dec
@FIRE 4rth Dec 2013
Content
Introduction
Our Approach
Syllabification
Our Results
Error And Analysis
Conclusion
Introduction
There is need to provide local language support in web based
applications because various domains such as ecommerce sites
require English knowledge.
The challenge in transliteration is take the word “राष्ट्रपति” for
this word “rashtrapati”, “rashtrapathi”, “raashtrapathy”,
“raashtrpati” are various possible combinations may possible
which one should be correct is again an issue.
Transliteration tasks become difficult in presence of out of
vocabulary words (OOV) and noisy words.
In both the subtasks, the transliteration was performed using
syllabification approach.
In the subtask-1, we had done the morphological analysis of
English words , then a corpus based approach used to identify
frequently occurring Hindi words.
In the subtask-2, the queries were formulated that contained
both Roman and Devanagari script and Roman script for
separate run submissions.
Syllabification Approach
syllable
Linguists have different languages have constraints on possible
consonant and vowel sequences that characterize not only the
word structure for the language but also the syllable structure.
Onset
Rhyme
Vowels @ center (nucleus)
consonant @ beginning (onset)
End is coda
coda
nucleus
Syllable Structure Example
Word
Sprint
Training Format
Source
sudakar
chhagan
jitesh
narayan
shiv
madhav
mohammad
Target
स ◌ु द ◌ुा क र
छगण
ज िु◌ ि ◌ु श
न ◌ुा र ◌ुा य ण
श िु◌ व
म ◌ुा ध व
म ◌ु ह म ◌ु म द
Algorithm for subtask-I
Step 1: First of all words are fetching in English dictionary.
Step 2: perform spell-check ,stemming and also morphological
analysis for English language, if no spell error and match found then
label the word as English =E.
Step 3: If English word are not found then check with English
corpus of US News paper.
Step 4: If English word found then check with English corpus of
Indian news paper.
Step 5: If English word found in US News paper and not found in
Indian news paper then word=E.
Step 6: Step 2 and step 5 are parallel apply for English
words and label as =\E.
Step 7: Remaining words would be transliterate into
Hindi words and Label the word as = \H.
Step 8: Apply to Moses tool ,which one is help English
words transliterate into Hindi words.
RESULT OF SUBTASK-1
Results For Subtask 2
Run 1 “मर सापन न कक रानी काब आयगी ि mere sapnon ki rani
kab aayegi tu”.
Run 2 “mere sapnon ki rani kab aayegi tu”.
Metrics
Run-1
Run-2
Maximum
Score
Median
Score
nDCG@5
0.5627
0.5262
0.8052
0.5620
nDCG@10
0.5619
0.5232
0.8002
0.5608
MAP
0.2546
0.2163
0.4236
0.2355
MRR
0.5835
0.5730
0.8440
0.5884
Error And Analysis
There are some problems in the transliteration which
decreased the precision.
Error in the maatra : “sapnon” => “सापन न”, “ki” => “की”,
“kab” => “काब”, “main” => “ममन” & “mein” => “मीन” , na
=> न & ka => क
Multiple Mapping of the words e.g. T = ि, ट, i.e. tera=>टरा,
tum => िूम, to => ट , teri =>टरर .
Missing sounds (फ, ख, छ ‘chh’, ksh) i. e. for word “accha” we
got “आक्का”, for , “poochho” we got “पछ
ू ट”.
Multiple Transliterations- c,k
The vowel are not giving perfect answers
i.e. “lo” => “लॉ” , “ho”=> “ह र”, “ko” => “कॉ”
Spelling Variations(shree,shri)
Conjuncts formation(“kya” => “कया”)
Missing of vowels ‘ak tr khan’ (अक ु िर खान)
‘y’ As Vowel: ‘anthony’ & ‘Shyam’
Conclusion
We used the syllabification approach and considered the most
probable term in the transliteration process. The word labeling
task was performed assuming that a term either belongs to
English language or Hindi language. We were able to get high
accuracy in English recall as the labeling approach used
morphological analysis and dictionary approach. However due
to syllabification model, the transliteration did not give high
precision resulting in lower precision of transliteration tasks
and subsequently lower precision metrics in the song lyrics
retrieval tasks.