abhishek-sanskrit-wordnet-jul09

Download Report

Transcript abhishek-sanskrit-wordnet-jul09

Tools and Interfaces for
Wordnet construction, linking
and maintenance
Abhishek G. Nanda
03005031
Under the guidance of:
Prof. Pushpak Bhattacharyya
Wordnet
Language - Means of communication
using encoded information
 Words - Units used for communicating
information
 Semantics - Meanings of words and
word forms

Wordnet
Dictionary - List of alphabetically
arranged words with meanings
 Thesaurus - List of alphabetically
arranged concepts with word forms

What is Wordnet?
Wordnet

Lexical database of words




Arranged based on concepts
Grouped based on synonymy
Synonymy - Property of different words
sharing same meaning in a context. Eg. buy
and purchase
Polysemy - Property of words having
different meanings based in different
contexts. Eg. bank as financial institution
and as river bank
Wordnet - Lexical Matrix
Word Forms
Word Meanings
F1
M1
M2
M3
…
Mm
(depend)
E1,1
F2
F3
(bank)
E1,2
(rely)
E1,3
Fn
(embankme
nt)
E2,…
(bank)
E2,2
(bank)
E3,2
…
E3,3
…
Em,n
Wordnet - Relations

Semantic Relations
Hypernymy and Hyponymy
 Meronymy and Holonymy
 Entailment
 Troponymy
 Coordinate terms


Lexical Relations
Antonymy
 Gradation

Wordnet - Relations

Hypernymy and Hyponymy
is a kind of
 leaf is the hypernym of neem leaf
 neem leaf is the hyponym of leaf


Meronymy and Holonymy
part-whole
 root is the meronym of tree
 tree is the holonym of root

Wordnet - Relations

Entailment
implication
 snore entails sleep


Troponymy
manner elaboration
 roar is the troponym of speak


Coordinate terms
Common hypernym
 wolf and dog are coordinate terms

Wordnet - Relations

Antonymy
opposites
 fat is the antonym of thin


Gradation
Intermediate concepts in antonymy
 morning -> noon -> evening

Wordnet - Wordnets
PWN - Princeton WordNet for English
language
 EuroWordNet - Wordnet for European
languages
 HWN - Hindi Wordnet for Hindi
language

Hindi Wordnet
Relations borrowed - synynymy,
hypernymy, holonymy, troponymy,
entailment, etc.
 Defines 8 part-whole relationships
 Defines 3 types of antonymy relations

Gradable antonym (गर्म-ठं डा)
 Complementary antonym (जीवित-र्त
ृ )
 Converse antonym (लेना-दे ना)

Hindi Wordnet

Gradation

Intermediate terms
• Pre-Intermediate terms
• Post-Intermediate terms


Eg. सख
ू ा - शष्ु क - नर् - तर - गीला
10 domains of interpretation. Eg.
State, Size, Gender, etc.
Hindi Wordnet - Verbs
Simple Verb - One root. Eg. खाना
 Compound Verb - Made up of another
POS. Eg. र्ीठा लगना
 Combination Verb - Made of related
two verbs. Eg. पढ़ना-ललखना
 Onomatopoeic Verb - Eg. खटखटाना
from खटखट
 Conjunct Verb - Hidden sense of
action. Eg. ले जाना

Hindi Wordnet - Verbs

Causative verbs
First causative verb - Eg. सल
ु ाना (to
make somebody sleep)
 Second causative verb - Eg. सल
ु िाना
(to make somebody sleep through the
effort of a third person)

Hindi Wordnet - Creation
Principles for Wordnet creation
 Minimality - Minimal set. Eg. {घर,
कर्रा, कक्ष}

Coverage - Coverage of words. Eg.
{घर, कर्रा, कक्ष}

Replaceability - Mutual replaceability
in a context. Eg. अर्ेररका र्ें दो साल
बिताने के िाद श्यार् स्िदे श/घर लौटा
Sanskrit Wordnet
Concept-based Multilingual dictionary
 Need
Loss of synonymy when moving
across languages. Eg. dark and evil
are synonymous in English but
counterparts अंधेरा and दष्ु ट are not.
 Number of lexicographers required O(n2)

Sanskrit Wordnet - Concept
based Multilingual dictionary
Concepts
L1 (English)
L2 (Hindi)
L3 (Sanskrit)
Concept ID:
Concept
description
(W1, W2, W3, ..)
(W4, W5, W6, ..)
(W7, W8, W9, ..)
(monkey)
(िंदर, िन्दर, िानर,
िानर, कीश,
कवप, र्कमट, ..)
(िानरः, कवपः,
प्लिङ्गः,
प्लिगः,
शाखार्ग
ृ ः,
िलीर्ख
ु ः, र्कमटः,
..)
(sun)
(सय
(सय
ू ,म सरू ज, भान,ु
ू ःम , सविता,
ददिाकर, भास्कर,
आददत्यः, लर्त्रः,
प्रभाकर, ददनकर,
अरुणः, भानुः,
रवि, ..)
पष
ू ा, अकमः, ..)
4066: any of
various longtailed primates
(excluding the
prosimians)
2186: a typical star
that is the
source of light
and heat for the
planets in the
solar system
Sanskrit Wordnet Challenges
Observed during construction of Marathi
Wordnet:
 Single word to synthetic expression. Eg.
bankrupt -> ददिाला ननकालना


Culture specific concepts. Eg. girlfriend.
Requires transliteration such as र्दिलालर्त्र
Splitting of concepts. Eg. फ़ीका (tasteless) in
Hindi -> अगोड (less sweet), अळणी (less
salty), लर्ळलर्ळत (less spicy) in Marathi
Sanskrit Wordnet Challenges
Observed during Indo Wordnet workshop at
Coimbatore, June 2009:
 Varied usage across regions and people.
Eg. In Kashmiri, separate words for drinking
water and water in Muslim community but
one word in hindu community.
 Single-word and multi-word expressions in
same language. Eg. In Nepali, र्ोि and र्ोिर्ाया both mean infatuation.
Sanskrit Wordnet - Sanskrit

Indo-Aryan language
Hinduism
 Buddhism

Classical Sanskrit - Panini
 Vedic Sanskrit - pre-Classical

Sanskrit Wordnet - Sanskrit
Etymology

Etymology of Verbs
गण - Ten classes based on how stem
is generated
 इट् - Three groups based on position
of tense marker
 उपसगम - 22 prepositional particles that
modify a root

Synset Marking
Grouping of synsets based on
frequency of occurrence and usage in
language
 Universal concepts

who and what
 honesty

SynsetMarker - Interface
SynsetMarker - Features



Display of synset fields
Browsing
Search





Word
ID
Marking - Universal, Common, Common in
Hindi and Uncommon
Save/Exit
Shortcuts
SynsetMarker - API

records
DefineRecord
 SynsetRecord


operations
SynsetOperator
 RecordReader
 RecordWriter


gui

Interface
SynsetMarker - Process

First round divided among 6 people
31000 synsets marked
 Universal and Common clubbed 15234 synsets
 Common in Hindi - 6771 synsets
 Uncommon - 10987 synsets


Second round voting schema

Common - 13205 synsets
Core Synset Selection

Bharatiya Vyavahara Kosh
English and 15 Indian languages
 2000 concepts with domains
 खेल (game), प्राणी (animal), फल (fruit)


Link synsets to words in Kosh

Polysemy
• अनन्नास as pineapple fruit
• अनन्नास as pineapple plant
DomainClassifier - Interface
DomainClassifier - Features
Display of synset fields
 Browsing through records
 Marking right synset for a word and a
domain
 Save/Export

DomainClassifier - API

records
DefineRecord
 SynsetRecord


operations
SynsetOperator
 RecordReader
 RecordWriter


gui

Interface
DomainClassifier - Process

Groupings
Single IDs
 Multiple IDs
 No IDs


Rounds of marking
Common synsets
 Common in Hindi synsets
 Uncommon synsets

DomainClassifier - Process

End of process
Core - 1969 synsets
 Common - 11658 synsets

Online SynsetMarker Interface
Online SynsetMarker Interface
Online SynsetMarker - API
Written in PHP








login.php - Interface to login as a user or as an admin or to
register as a new user
process.php - To process login/register data and
accordingly direct a user
logout.php - To logout a user
mainprocess.php - Processing of data to display unmarked
synset
main.php - Display of synset with buttons to mark as
Common or Uncommon
admin.php - Admin page with statistical data of number of
marked synsets per user and number of users based on
synset marks
adminpassword.php - Password interface to login as admin
adminuserprofile.php - Profile data of a particular user
Online SynsetMarker Process

Threshold for dropping synset as
Uncommon


Had to be set to 1
Common - 10312 synsets
Sanskrit Wordnet Interface
Interface for creation of Sanskrit
Wordnet
 Based on idea of Concept-based
Multilingual dictionary

User Interface - Configure
User Interface - Main
User Interface - Panels





Help Panel: Buttons for Commenting,
Synchronizing and References tool.
Search Panel: Search word or ID or perform
advanced search. Font increase/decrease.
Synset Panels: Synset data fields and
completion status.
Tool Panel: English synset, Link tool,
Etymology tool.
Browse Panel: Browsing through records,
saving and exiting.
User Interface - Features Reference tool
User Interface - Features Synchronize tool
User Interface - Features Advanced Search
User Interface - Features English synsets tool
User Interface - Features Link tool
User Interface - Features Etymology tool
User Interface - Features Keyboard Shortcuts
Undo feature - Monitor keyboard
actions and undo on Ctrl-Z
 Saving feature - Monitor change in
field values and save on Ctrl-S
 Search - Ctrl-F for quick search
access

Interface API
Problems and Requirements
Huge volumes of data (eg. 30,000
synsets)
 Links between different data
 Efficient and user-friendly GUI
 Sufficient querying

• Grouping
• Review separation
Interface API
Graphical User Interface
JButton saveButton = null;
public JButton getSaveButton() {
if (saveButton == null) {
saveButton = new JButton();
}
return saveButton;
}
Graphical User Interface
Graphical User Interface Panels
Graphical User Interface

Panels


Components (within Panels)


Hierarchical structure
Classes JButton, JTextField, JCheckBox,
etc.
Listeners


ActionListner - actions performed by user
KeyListener - key strokes (undo, search) and
shortcuts
Synset





Synset ID: a unique number identifying a
synset
Category: POS category of the words
Concept: The part of the gloss that gives a
brief summary of what the synset
represents
Example: One or more examples of the
words in the synset being used in
sentences
Synset: The set of synonymous words
comprised in the synset
Synset - DSF format
ID :: 121
CATEGORY :: NOUN
CONCEPT :: अपने से छोटों के प्रनत हृदय
र्ें उठनेिाला प्रेर्
EXAMPLE :: “चाचा नेिरू को िच्चों से
ििुत िी स्नेि था”
SYNSET :: स्नेि,नेि,लगाि,र्र्ता
Data structure SynsetRecord
Class SynsetRecord
Strings to hold field values
 Functions:

equals(otherObject)
 isBetterThan(otherObject)
 isComplete()
…

Data structure DefineRecord
“define-end” language
Example (description of a book about cricket):
define book sixer
length :: 700
topic :: cricket
define chapter 1
length :: 300
topic :: batting
end
define chapter 2
length :: 400
topic :: bowling :: scientific
end
end
Data structure DefineRecord
Example (etymology format):
define etymformat verb
इट् :: dropdown :: word :: सेट्, अननट्, िेट्
पद :: dropdown :: word :: आत्र्नेपद, परस्र्ैपद, उभयपद
कर्मित्त्ि :: dropdown :: synset :: सकर्मक, अकर्मक
कृत ् रूप :: textfield :: word
उपसगम :: dropdown :: word :: प्र, परा, अप, सर् ्, अन,ु अि,
ननस ्, ननर्, दस
ु ्, दरु ,् वि, अधध, अवप, परर, नन, आ, प्रनत, उप, स,ु
उत ्, अलभ, अनत
साधधत धातु :: dropdown :: word ::णणच ्, सन ्, यङ्, यङ्लुक्,
नार्धातु
end
Data structure DefineRecord
Data structure DefineRecord
Example (etymology data for synset ID 1476):
define etymology 1476
कर्मित्त्ि :: अकर्मक
finished :: true
define word क्षक्ष
इट् :: सेट्
पद :: परस्र्ैपद
स्िर :: कृत ् रूप :: क्षयः
उपसगम :: अप
साधधत धातु :: end
end
Data structure DefineRecord


Data structure to hold parametric and
nested data
Functions:




addField(objectToAdd) - Function to add a
parameter or a nested instance of
DefineRecord
toString() - Function to export a record in the
define-end language
getParameterField(parameterName) Function to return a specific parameter field
…
Data Operations
Data Operations - File I/O
Unicode text data manipulation UTF-8 format
 Classes for file parsing/writing:

RecordWriter
 RecordReader

Data Operations - File I/O

RecordReader
SynsetRecord parser
 DefineRecord parser
 String converters


RecordWriter
SynsetRecord parser
 DefineRecord parser

Data Operations RecordModel Interface
Model to create mechanism for
working with a new data structure
 Handles parsing, writing, querying and
ID retrieval
 Models written as Classes:


SynsetRecordModel
• EnglishSynsetRecordModel

AbstractDefineRecordModel
Data Operations RecordModel Interface







int getRecordId(E record): Function to return the
record ID of a record
boolean isBetterThan(E a, E a): Function to return
whether a record weighs better than the other
boolean isFinished(E a): Function to return whether a
record can be set as completed
E mergeRecords(E a, E b): Function to merge in data
in two separate records into one
boolean searchWord(String word, E a): Function to
perform a query (defined in String word) on a record
E parseRecord(RecordReader fileHandle): Function to
parse a record from a file
void writeRecord(RecordReader fileHandle, E a):
Function to write a record into a file
Data Operations RecordOperator Class
Operator to provide functionality to
work with records of data
 Load, Browse, Update, Search,
Synchronize and Write
 Two kinds at the GUI level:

Parent Operator
 Linker Operator

Data Operations RecordOperator Class
Functions for each data type (depending on the
corresponding RecordModel):







Constructors for ParentOperator and LinkerOperator
getRecord() - Function to obtain the current record
setCurrentId() and getCurrentId() - Functions to set
and obtain ID to work with
getFirstId(), getPreviousId(), getNextId() and
getLastId() - Functions to browse through records
isFinished and isAllFinished() - Functions to obtain
completion status of records
searchRecords() and advancedSearch() - Functions to
perform search operations on the records
…
API Overview
GUI defines one ParentOperator (eg.
source synsets)
 GUI defines many LinkerOperators
(eg. target synsets, link data, etc.)
 Models attached to the operators
 Data repositories are defined
 GUI browses, retrieves and
manipulates data using operators.

Version history
Future work
Tool to generate etymology format
 GUI functionality to display synsets
from multiple languages
 Advanced commenting based on
reviews and completion

References







Miller G.A., Beckwith R., Fellbaum C., Gross D., Miller K.J., "Introduction to
WordNet: An On-line Lexical Database", International Journal of
Lexicography, Vol. 3, No. 4, 1990, pp. 235-244.
Ramanand J., Ukey A., Singh B.K., Bhattacharyya P., "Mapping and
Structural Analysis of Multilingual Wordnets", IEEE Data Engineering
Bulletin, Vol. 30, No. 1, 2007, pp. 30-43.
Hindi Wordnet Documentation,
http://www.cfilt.iitb.ac.in/wordnet/webhwn/other/hwn_docs_2.doc
Chakrabarti D., Narayan D.K., Pandey P., Bhattacharyya P., "Experiences in
building the Indo WordNet - A WordNet for Hindi", in First International
Wordnet Conference, CIIL, Mysore, India, 2002.
Mohanty R.K., Bhattacharyya P., Kalele S., Pandey P., Sharma A., Kopra
M., "Synset Based Multilingual Dictionary: Insights, Applications and
Challenges", in Proceedings of the Fourth Global WordNet Conference,
University of Szeged, Department of Informatics, 2008.
Sinha, M., Reddy, M., Bhattacharyya, P., "An Approach towards
Construction and Application of Multilingual Indo-WordNet", in Proceedings
of the Third Global Wordnet Conference, Jeju Island, Korea, 2006.
Staal J.F., "Sanskrit and Sanskritization", The Journal of Asian Studies, Vol.
22, No. 3, 1963, pp. 261–275.
References










MacDonell A.A., A History Of Sanskrit Literature, Kessinger Publishing, ISBN
1417906197, 2004.
Burrow T., Sanskrit language, Motilal Banarsidass, ISBN 8120817672, 2001.
Goldman R.P. and Sutherland S.J., Devavanipravesika: An Introduction to
the Sanskrit Language, ISBN 0-944613-40-3, 1999.
Macdonell A.A., A Sanskrit Grammar for Students, ISBN 81-246-0094-5,
1997.
Monier-Williams M., A Sanskrit English Dictionary, Motilal Banarsidass,
(reprint) New Delhi, ISBN 81-208-3105-5, 2005.
Katre S.M., Ashtadhyayi of Panini, Motilal Banarsidass, New Delhi, 1989.
Indian Languages, http://www.english.emory.edu/Bahri/IndLangs.html
Wierzbicka A., "Universal human concepts as a tool for exploring bilingual
lives", International Journal of Bilingualism, Vol. 9, No. 1, 2005, pp. 7-26.
Beckwith R., Miller G.A., Tengi R., "Design and Implementation of the
WordNet Lexical Database and Searching Software", Description of
WordNet, 1993.
JSch - Java Secure Channel, http://www.jcraft.com/jsch
Thank you