Tamil - cfilt - Resource Centre for Indian Language

Download Report

Transcript Tamil - cfilt - Resource Centre for Indian Language

29 April 2013
DRAVIDIAN WORDNET
S.Arulmozi
Dravidian University
29 April 2013
Tamil Thesaurus
• Preliminary work on lexical semantics.
• Monumental work on Tamil Thesaurus.
• Ontologicial classification of Tamil Vocabulary
• Rajendran, S. (2001) tamizhc
coRkaLanjciyam. (in Tamil).Tamil University
Publication.
29 April 2013
Domains in Tamil Thesaurus
• Tamil vocabulary is classified into four
major domains:
• Entities
• Abstracts
• Events and
• Relationals
29 April 2013
Lexical Hierarchy of the Domain `Construction’
parumaippeyarkaL
`concrete nouns
'
aHRinaippeyarkaL
`irrational nouns'
uyirillaatavai
`non-living beings'
uruvaakkiya maRRum patananjceyta poruTkaL
`manufactured and processed items'
kaTTappaTTavai
`constructed'
29 April 2013
Nouns
Relations
Synonymy
Hypernymy-Hyponymy
Hyponym-Hypernymy
Holonymy-Meronymy
Meronymy-Holonymy
Related Verb
Coordinate terms
Example
viiTu ‘house’ - illam `house‘
paLLi 'school' – kalviccaalai
'educational institution‘
kalluuri 'college' –
aracukkalluuri `govt college‘
ndaaRkaali 'chair' - kaal 'leg‘
cakkaram 'wheel' to vaNTi
'cart‘
paTittal ‘reading’ – paTi ‘read’
kooyil `temple' – macuuti
'mosque'
29 April 2013
Verbs
Relations
Synonym
Hypernymy
Troponymy
Nominal
Related Noun
Example
paTi ‘read’ – payilu ‘read’
cuvai ‘taste’ – uNar
keeL ‘ask’– kenjcu ‘plead’
paruku `drink’ – parukutal `drinking’
kaNTupiTi `discover’ – kaNTupiTippu
`discovery’
29 April 2013
Tamil WordNet
Objective: To build a WordNet for Tamil to
enhance machine translation
Resources: Tamil Thesaurus, Technical
Glossaries (Tamil University Publications),
Princeton English WordNet
Funding Agency: Tamil Software Development
Fund, Tamil Virtual University - 4 lacs
Time Frame: 18 months
29 April 2013
Details
Software
used
– Java
Back-end - Mysql Database
Front-end

Project
50k
Deliverables
root words
Relationships coded
Stand-alone and web-based interface
Embedded morphological analyser
29 April 2013
Statistics
Total Words: 50497
Unique Senses:
41013
Nouns: 46710
Verbs: 2881
Adjectives: 416
Adverbs: 490
29 April 2013
Total Words: 50497
Unique Senses: 41013
50000
45000
40000
35000
30000
25000
20000
15000
10000
5000
0
Total Words
Unique Senses (Tokens)
Nouns
Verbs
Adjectives
Adverbs
Project Completed (2004)
http://www.nrcfosshelpline.in/code/wiki/TamilWordnet
29 April 2013
29 April 2013
Standalone version – Tamil WordNet (Snapshot)
29 April 2013
Standalone version – Tamil WordNet (Snapshot)
29 April 2013
Web-version – Tamil WordNet (Snapshot)
29 April 2013
Web-version – Tamil WordNet (Snapshot)
29 April 2013
First Effort on Dravidian Languages
• National Workshop on WordNet for Dravidian
Languages
•2-3 June 2003
•Organized by AU-KBC Research Centre,
Chennai, Central Institute of Indian
Languages, Mysore and Tamil University.
•Hands-on experience on specified domain –
construction
•Report available on Global WordNet website
29 April 2013
MHRD Project
Creation of Machine Translation tools and resources
for English to Dravidian Languages: Pilot Study
to develop Machine Translation(MT) system and needed
linguistic resources for
 English-Dravidian languages(Tamil, Malayalam, Telugu and Kannada),
This would facilitate the creation of rich educational contents in
Indian languages.
This research effort is to make all the tools and translation
system to be based on Machine Learning methodologies so
that computer graduates and other such non-linguists are able
to immediately participate in the national mission on literacy by
contributing additional tools for language translation.
29 April 2013
Modules
• Module 1: Machine Translation
• aims at developing teaching material corresponding to the tools
developed so that it can be delivered as part of undergraduate
computer science and engineering curriculum on data
mining/machine learning.
• This will ensure a critical amount of man power required for
sustaining translation effort needed for national mission on
education.
• Module 2: Training
• aims at training 500 faculties selected from across the country on
machine translation methodologies using machine learning
techniques.
• Module 3: Dravidian WordNet
• aims at developing a Dravidian WordNet required for translation.
29 April 2013
Total Budget
• IIT Bombay – 15 lacs
• Amrita University – 40 lacs
• Tamil University – 15 lacs
• University of Hyderabad – 15 lacs
• Dravidian University – 15 lacs
• Time Frame
• 12 months
• March 30, 2009 – March 29, 2010
29 April 2013
Work done
• Part of a one year Pilot project involving
Tamil, Telugu, Malayalam and Kannada
• Funding Agency: Ministry of HRD
• Duration: 18 months (July 2009-Dec 2010)
• Deliverable: 13k synsets
• 7k synsets linked to IndoWordNet,
available at
http://www.cfilt.iitb.ac.in/wordnet/webhwn/wn.php
29 April 2013
Statistics on Dravidian WordNet
29 April 2013
Publications
`Tamil WordNet’, Proceedings of the Fifth Global WordNet
Conference, IIT-Bombay, 31 Jan-4 Feb 2010 (S.Rajendran)
`Building a WordNet’ for Dravidian Languages, Proceedings of
the Fifth Global WordNet Conference, IIT-Bombay, 31 Jan-4
Feb 2010 (S.Rajendran, S.Gopakumar, V.Dhanalakshmi)
`Representation of Kinship in WordNet’, Proceedings of the 9th
International Tamil Internet Conference, Coimbatore, 23-27
June 2010 (S.Arulmozi)
`Polysemy in Tamil and other Indian Languages’, Proceedings
of the Fifth Global WordNet Conference, IIT-Bombay, 31 Jan-4
Feb 2010 (S.Arulmozi & Panchanan Mohanty)
`Telugu WordNet’, Proceedings of the Fifth Global WordNet
Conference, IIT-Bombay, 31 Jan-4 Feb 2010 (S.Arulmozi)
29 April 2013
First IndoWordNet Workshop
• Amrita University
• 11-14 June 2009
• Necessity for developing linked WordNets of different
languages of India was stressed
• Challenges such as language divergence, lexical semantics,
embedding WordNet in MT and cross-lingual search applications
can be achieved
• Participation from groups: Hindi, Marathi, Sanskrit, Nepali,
Assamese, Bodo, Manipuri, Konkani, Kashmiri, Tamil,
Telugu, Malayalam, Kannada
• Proposal on Indhradhanush
29 April 2013
Dravidian WordNet
• Present Project
• Funded by DIT.
29 April 2013
Links
 Tamil WordNet – Open Source
http://www.nrcfosshelpline.in/code/wiki/TamilWordnet
 VerbNet (English)
http://verbs.colorado.edu/~mpalmer/projects/verbnet.html
 Princeton English WordNet
http://wordnet.princeton.edu/
 Global WordNet Association
http://www.globalwordnet.org/
 WordNets in the World
http://www.globalwordnet.org/gwa/wordnet_table.htm
 WordNet Bibliography
http://lit.csci.unt.edu/~wordnet/
 IndoWordNet
http://www.cfilt.iitb.ac.in/wordnet/webhwn/wn.php
29 April 2013
Thank you!