abhishek-sanskrit-wordnet-jul09
Download
Report
Transcript abhishek-sanskrit-wordnet-jul09
Tools and Interfaces for
Wordnet construction, linking
and maintenance
Abhishek G. Nanda
03005031
Under the guidance of:
Prof. Pushpak Bhattacharyya
Wordnet
Language - Means of communication
using encoded information
Words - Units used for communicating
information
Semantics - Meanings of words and
word forms
Wordnet
Dictionary - List of alphabetically
arranged words with meanings
Thesaurus - List of alphabetically
arranged concepts with word forms
What is Wordnet?
Wordnet
Lexical database of words
Arranged based on concepts
Grouped based on synonymy
Synonymy - Property of different words
sharing same meaning in a context. Eg. buy
and purchase
Polysemy - Property of words having
different meanings based in different
contexts. Eg. bank as financial institution
and as river bank
Wordnet - Lexical Matrix
Word Forms
Word Meanings
F1
M1
M2
M3
…
Mm
(depend)
E1,1
F2
F3
(bank)
E1,2
(rely)
E1,3
Fn
(embankme
nt)
E2,…
(bank)
E2,2
(bank)
E3,2
…
E3,3
…
Em,n
Wordnet - Relations
Semantic Relations
Hypernymy and Hyponymy
Meronymy and Holonymy
Entailment
Troponymy
Coordinate terms
Lexical Relations
Antonymy
Gradation
Wordnet - Relations
Hypernymy and Hyponymy
is a kind of
leaf is the hypernym of neem leaf
neem leaf is the hyponym of leaf
Meronymy and Holonymy
part-whole
root is the meronym of tree
tree is the holonym of root
Wordnet - Relations
Entailment
implication
snore entails sleep
Troponymy
manner elaboration
roar is the troponym of speak
Coordinate terms
Common hypernym
wolf and dog are coordinate terms
Wordnet - Relations
Antonymy
opposites
fat is the antonym of thin
Gradation
Intermediate concepts in antonymy
morning -> noon -> evening
Wordnet - Wordnets
PWN - Princeton WordNet for English
language
EuroWordNet - Wordnet for European
languages
HWN - Hindi Wordnet for Hindi
language
Hindi Wordnet
Relations borrowed - synynymy,
hypernymy, holonymy, troponymy,
entailment, etc.
Defines 8 part-whole relationships
Defines 3 types of antonymy relations
Gradable antonym (गर्म-ठं डा)
Complementary antonym (जीवित-र्त
ृ )
Converse antonym (लेना-दे ना)
Hindi Wordnet
Gradation
Intermediate terms
• Pre-Intermediate terms
• Post-Intermediate terms
Eg. सख
ू ा - शष्ु क - नर् - तर - गीला
10 domains of interpretation. Eg.
State, Size, Gender, etc.
Hindi Wordnet - Verbs
Simple Verb - One root. Eg. खाना
Compound Verb - Made up of another
POS. Eg. र्ीठा लगना
Combination Verb - Made of related
two verbs. Eg. पढ़ना-ललखना
Onomatopoeic Verb - Eg. खटखटाना
from खटखट
Conjunct Verb - Hidden sense of
action. Eg. ले जाना
Hindi Wordnet - Verbs
Causative verbs
First causative verb - Eg. सल
ु ाना (to
make somebody sleep)
Second causative verb - Eg. सल
ु िाना
(to make somebody sleep through the
effort of a third person)
Hindi Wordnet - Creation
Principles for Wordnet creation
Minimality - Minimal set. Eg. {घर,
कर्रा, कक्ष}
Coverage - Coverage of words. Eg.
{घर, कर्रा, कक्ष}
Replaceability - Mutual replaceability
in a context. Eg. अर्ेररका र्ें दो साल
बिताने के िाद श्यार् स्िदे श/घर लौटा
Sanskrit Wordnet
Concept-based Multilingual dictionary
Need
Loss of synonymy when moving
across languages. Eg. dark and evil
are synonymous in English but
counterparts अंधेरा and दष्ु ट are not.
Number of lexicographers required O(n2)
Sanskrit Wordnet - Concept
based Multilingual dictionary
Concepts
L1 (English)
L2 (Hindi)
L3 (Sanskrit)
Concept ID:
Concept
description
(W1, W2, W3, ..)
(W4, W5, W6, ..)
(W7, W8, W9, ..)
(monkey)
(िंदर, िन्दर, िानर,
िानर, कीश,
कवप, र्कमट, ..)
(िानरः, कवपः,
प्लिङ्गः,
प्लिगः,
शाखार्ग
ृ ः,
िलीर्ख
ु ः, र्कमटः,
..)
(sun)
(सय
(सय
ू ,म सरू ज, भान,ु
ू ःम , सविता,
ददिाकर, भास्कर,
आददत्यः, लर्त्रः,
प्रभाकर, ददनकर,
अरुणः, भानुः,
रवि, ..)
पष
ू ा, अकमः, ..)
4066: any of
various longtailed primates
(excluding the
prosimians)
2186: a typical star
that is the
source of light
and heat for the
planets in the
solar system
Sanskrit Wordnet Challenges
Observed during construction of Marathi
Wordnet:
Single word to synthetic expression. Eg.
bankrupt -> ददिाला ननकालना
Culture specific concepts. Eg. girlfriend.
Requires transliteration such as र्दिलालर्त्र
Splitting of concepts. Eg. फ़ीका (tasteless) in
Hindi -> अगोड (less sweet), अळणी (less
salty), लर्ळलर्ळत (less spicy) in Marathi
Sanskrit Wordnet Challenges
Observed during Indo Wordnet workshop at
Coimbatore, June 2009:
Varied usage across regions and people.
Eg. In Kashmiri, separate words for drinking
water and water in Muslim community but
one word in hindu community.
Single-word and multi-word expressions in
same language. Eg. In Nepali, र्ोि and र्ोिर्ाया both mean infatuation.
Sanskrit Wordnet - Sanskrit
Indo-Aryan language
Hinduism
Buddhism
Classical Sanskrit - Panini
Vedic Sanskrit - pre-Classical
Sanskrit Wordnet - Sanskrit
Etymology
Etymology of Verbs
गण - Ten classes based on how stem
is generated
इट् - Three groups based on position
of tense marker
उपसगम - 22 prepositional particles that
modify a root
Synset Marking
Grouping of synsets based on
frequency of occurrence and usage in
language
Universal concepts
who and what
honesty
SynsetMarker - Interface
SynsetMarker - Features
Display of synset fields
Browsing
Search
Word
ID
Marking - Universal, Common, Common in
Hindi and Uncommon
Save/Exit
Shortcuts
SynsetMarker - API
records
DefineRecord
SynsetRecord
operations
SynsetOperator
RecordReader
RecordWriter
gui
Interface
SynsetMarker - Process
First round divided among 6 people
31000 synsets marked
Universal and Common clubbed 15234 synsets
Common in Hindi - 6771 synsets
Uncommon - 10987 synsets
Second round voting schema
Common - 13205 synsets
Core Synset Selection
Bharatiya Vyavahara Kosh
English and 15 Indian languages
2000 concepts with domains
खेल (game), प्राणी (animal), फल (fruit)
Link synsets to words in Kosh
Polysemy
• अनन्नास as pineapple fruit
• अनन्नास as pineapple plant
DomainClassifier - Interface
DomainClassifier - Features
Display of synset fields
Browsing through records
Marking right synset for a word and a
domain
Save/Export
DomainClassifier - API
records
DefineRecord
SynsetRecord
operations
SynsetOperator
RecordReader
RecordWriter
gui
Interface
DomainClassifier - Process
Groupings
Single IDs
Multiple IDs
No IDs
Rounds of marking
Common synsets
Common in Hindi synsets
Uncommon synsets
DomainClassifier - Process
End of process
Core - 1969 synsets
Common - 11658 synsets
Online SynsetMarker Interface
Online SynsetMarker Interface
Online SynsetMarker - API
Written in PHP
login.php - Interface to login as a user or as an admin or to
register as a new user
process.php - To process login/register data and
accordingly direct a user
logout.php - To logout a user
mainprocess.php - Processing of data to display unmarked
synset
main.php - Display of synset with buttons to mark as
Common or Uncommon
admin.php - Admin page with statistical data of number of
marked synsets per user and number of users based on
synset marks
adminpassword.php - Password interface to login as admin
adminuserprofile.php - Profile data of a particular user
Online SynsetMarker Process
Threshold for dropping synset as
Uncommon
Had to be set to 1
Common - 10312 synsets
Sanskrit Wordnet Interface
Interface for creation of Sanskrit
Wordnet
Based on idea of Concept-based
Multilingual dictionary
User Interface - Configure
User Interface - Main
User Interface - Panels
Help Panel: Buttons for Commenting,
Synchronizing and References tool.
Search Panel: Search word or ID or perform
advanced search. Font increase/decrease.
Synset Panels: Synset data fields and
completion status.
Tool Panel: English synset, Link tool,
Etymology tool.
Browse Panel: Browsing through records,
saving and exiting.
User Interface - Features Reference tool
User Interface - Features Synchronize tool
User Interface - Features Advanced Search
User Interface - Features English synsets tool
User Interface - Features Link tool
User Interface - Features Etymology tool
User Interface - Features Keyboard Shortcuts
Undo feature - Monitor keyboard
actions and undo on Ctrl-Z
Saving feature - Monitor change in
field values and save on Ctrl-S
Search - Ctrl-F for quick search
access
Interface API
Problems and Requirements
Huge volumes of data (eg. 30,000
synsets)
Links between different data
Efficient and user-friendly GUI
Sufficient querying
• Grouping
• Review separation
Interface API
Graphical User Interface
JButton saveButton = null;
public JButton getSaveButton() {
if (saveButton == null) {
saveButton = new JButton();
}
return saveButton;
}
Graphical User Interface
Graphical User Interface Panels
Graphical User Interface
Panels
Components (within Panels)
Hierarchical structure
Classes JButton, JTextField, JCheckBox,
etc.
Listeners
ActionListner - actions performed by user
KeyListener - key strokes (undo, search) and
shortcuts
Synset
Synset ID: a unique number identifying a
synset
Category: POS category of the words
Concept: The part of the gloss that gives a
brief summary of what the synset
represents
Example: One or more examples of the
words in the synset being used in
sentences
Synset: The set of synonymous words
comprised in the synset
Synset - DSF format
ID :: 121
CATEGORY :: NOUN
CONCEPT :: अपने से छोटों के प्रनत हृदय
र्ें उठनेिाला प्रेर्
EXAMPLE :: “चाचा नेिरू को िच्चों से
ििुत िी स्नेि था”
SYNSET :: स्नेि,नेि,लगाि,र्र्ता
Data structure SynsetRecord
Class SynsetRecord
Strings to hold field values
Functions:
equals(otherObject)
isBetterThan(otherObject)
isComplete()
…
Data structure DefineRecord
“define-end” language
Example (description of a book about cricket):
define book sixer
length :: 700
topic :: cricket
define chapter 1
length :: 300
topic :: batting
end
define chapter 2
length :: 400
topic :: bowling :: scientific
end
end
Data structure DefineRecord
Example (etymology format):
define etymformat verb
इट् :: dropdown :: word :: सेट्, अननट्, िेट्
पद :: dropdown :: word :: आत्र्नेपद, परस्र्ैपद, उभयपद
कर्मित्त्ि :: dropdown :: synset :: सकर्मक, अकर्मक
कृत ् रूप :: textfield :: word
उपसगम :: dropdown :: word :: प्र, परा, अप, सर् ्, अन,ु अि,
ननस ्, ननर्, दस
ु ्, दरु ,् वि, अधध, अवप, परर, नन, आ, प्रनत, उप, स,ु
उत ्, अलभ, अनत
साधधत धातु :: dropdown :: word ::णणच ्, सन ्, यङ्, यङ्लुक्,
नार्धातु
end
Data structure DefineRecord
Data structure DefineRecord
Example (etymology data for synset ID 1476):
define etymology 1476
कर्मित्त्ि :: अकर्मक
finished :: true
define word क्षक्ष
इट् :: सेट्
पद :: परस्र्ैपद
स्िर :: कृत ् रूप :: क्षयः
उपसगम :: अप
साधधत धातु :: end
end
Data structure DefineRecord
Data structure to hold parametric and
nested data
Functions:
addField(objectToAdd) - Function to add a
parameter or a nested instance of
DefineRecord
toString() - Function to export a record in the
define-end language
getParameterField(parameterName) Function to return a specific parameter field
…
Data Operations
Data Operations - File I/O
Unicode text data manipulation UTF-8 format
Classes for file parsing/writing:
RecordWriter
RecordReader
Data Operations - File I/O
RecordReader
SynsetRecord parser
DefineRecord parser
String converters
RecordWriter
SynsetRecord parser
DefineRecord parser
Data Operations RecordModel Interface
Model to create mechanism for
working with a new data structure
Handles parsing, writing, querying and
ID retrieval
Models written as Classes:
SynsetRecordModel
• EnglishSynsetRecordModel
AbstractDefineRecordModel
Data Operations RecordModel Interface
int getRecordId(E record): Function to return the
record ID of a record
boolean isBetterThan(E a, E a): Function to return
whether a record weighs better than the other
boolean isFinished(E a): Function to return whether a
record can be set as completed
E mergeRecords(E a, E b): Function to merge in data
in two separate records into one
boolean searchWord(String word, E a): Function to
perform a query (defined in String word) on a record
E parseRecord(RecordReader fileHandle): Function to
parse a record from a file
void writeRecord(RecordReader fileHandle, E a):
Function to write a record into a file
Data Operations RecordOperator Class
Operator to provide functionality to
work with records of data
Load, Browse, Update, Search,
Synchronize and Write
Two kinds at the GUI level:
Parent Operator
Linker Operator
Data Operations RecordOperator Class
Functions for each data type (depending on the
corresponding RecordModel):
Constructors for ParentOperator and LinkerOperator
getRecord() - Function to obtain the current record
setCurrentId() and getCurrentId() - Functions to set
and obtain ID to work with
getFirstId(), getPreviousId(), getNextId() and
getLastId() - Functions to browse through records
isFinished and isAllFinished() - Functions to obtain
completion status of records
searchRecords() and advancedSearch() - Functions to
perform search operations on the records
…
API Overview
GUI defines one ParentOperator (eg.
source synsets)
GUI defines many LinkerOperators
(eg. target synsets, link data, etc.)
Models attached to the operators
Data repositories are defined
GUI browses, retrieves and
manipulates data using operators.
Version history
Future work
Tool to generate etymology format
GUI functionality to display synsets
from multiple languages
Advanced commenting based on
reviews and completion
References
Miller G.A., Beckwith R., Fellbaum C., Gross D., Miller K.J., "Introduction to
WordNet: An On-line Lexical Database", International Journal of
Lexicography, Vol. 3, No. 4, 1990, pp. 235-244.
Ramanand J., Ukey A., Singh B.K., Bhattacharyya P., "Mapping and
Structural Analysis of Multilingual Wordnets", IEEE Data Engineering
Bulletin, Vol. 30, No. 1, 2007, pp. 30-43.
Hindi Wordnet Documentation,
http://www.cfilt.iitb.ac.in/wordnet/webhwn/other/hwn_docs_2.doc
Chakrabarti D., Narayan D.K., Pandey P., Bhattacharyya P., "Experiences in
building the Indo WordNet - A WordNet for Hindi", in First International
Wordnet Conference, CIIL, Mysore, India, 2002.
Mohanty R.K., Bhattacharyya P., Kalele S., Pandey P., Sharma A., Kopra
M., "Synset Based Multilingual Dictionary: Insights, Applications and
Challenges", in Proceedings of the Fourth Global WordNet Conference,
University of Szeged, Department of Informatics, 2008.
Sinha, M., Reddy, M., Bhattacharyya, P., "An Approach towards
Construction and Application of Multilingual Indo-WordNet", in Proceedings
of the Third Global Wordnet Conference, Jeju Island, Korea, 2006.
Staal J.F., "Sanskrit and Sanskritization", The Journal of Asian Studies, Vol.
22, No. 3, 1963, pp. 261–275.
References
MacDonell A.A., A History Of Sanskrit Literature, Kessinger Publishing, ISBN
1417906197, 2004.
Burrow T., Sanskrit language, Motilal Banarsidass, ISBN 8120817672, 2001.
Goldman R.P. and Sutherland S.J., Devavanipravesika: An Introduction to
the Sanskrit Language, ISBN 0-944613-40-3, 1999.
Macdonell A.A., A Sanskrit Grammar for Students, ISBN 81-246-0094-5,
1997.
Monier-Williams M., A Sanskrit English Dictionary, Motilal Banarsidass,
(reprint) New Delhi, ISBN 81-208-3105-5, 2005.
Katre S.M., Ashtadhyayi of Panini, Motilal Banarsidass, New Delhi, 1989.
Indian Languages, http://www.english.emory.edu/Bahri/IndLangs.html
Wierzbicka A., "Universal human concepts as a tool for exploring bilingual
lives", International Journal of Bilingualism, Vol. 9, No. 1, 2005, pp. 7-26.
Beckwith R., Miller G.A., Tengi R., "Design and Implementation of the
WordNet Lexical Database and Searching Software", Description of
WordNet, 1993.
JSch - Java Secure Channel, http://www.jcraft.com/jsch
Thank you