Konkani Wordnet Development

Download Report

Transcript Konkani Wordnet Development

INDRADHANUSH WORDNET
DEVELOPMENT
FOR
PUNJABI LANGUAGE
Dr. Suman Preet
Department of Linguistics and Punjabi Lexicography,
Punjabi University, Patiala
NATURE OF TASK
Synset Creation for Nouns, Adjectives, Verbs and
Adverbs
 Creation of Language Specific Synsets

Sense Marking
 Validation
 New Synset Creation for Hindi WordNet

GOALS SET IN THE LAST PRSG
•
•
•
•
•
To complete the linking of 36,534 Synsets.
Validation of 36,534 Synsets.
To create 1000 LSS.
Creation and maintenance of Individual WordNet Group
Websites
To complete sense marking on 1,00,000 words.
Presentation Outline
 Financial Details
 Sense Marking Details
 Synset Creation Details
 Validation Details
 Problems and Suggestions
Financial Details
Total grant sanctioned
Total grant released
1st year (released)
2nd year (released)
Recently released
Rs 22,14,000/Rs 20,23,974/Rs 11,44,000/Rs 08,79,974/Rs 1,86,833/-
Headwise Break-up of Expenditure
Table1 : HEADWISE BREAK-UP OF EXPENDITURE (in Rupees)
1 Capital Equipment
Total Amount Spent as on 1st
February 2013
Requested Revised Amount (RRA)
Change in Approved Amount (CiA)
Balance of Released Amt. as on 1st
February 2013
D
E
F
G
H
165000
157272
0
157272
165000
0
13366
99504
100000
0
110000
1098676
1260000
0
0
0
0
0
17287
113740
200000
0
0
0
0
0
0
0
0
0
0
200000
200000
0
0
0
0
0
0
289000
289000
0
140653
1958192
2214000
0
157272
2 Consumable Stores
100000
100000
1260000
1088870
0
0
200000
188832
6 Travel (Horizontal)
0
0
7 Workshop & Training
0
0
8 Contingencies
200000
200000
9 Co-ordination
0
0
289000
289000
2214000
2023974
3 Manpower (Vertical)
4 Manpower (Horizontal)
5 Travel (Vertical)
10 Overheads
Total
86138
988676
0
96453
0
0
200000
0
289000
1817539
I
J
Balance as on 1 April, 2013
Actual Expenditure from 1
December 2012 to 31st January 2013
C
Actual Expenditure from 1 Feb 2013
to 31st March 2013
Total Amount Spent as on 30th
November 2012
B
Amt. as on 1 March, 2013
Total Amount Released
A
Last Amt. Received from Funding
Agency(in Feb, 2013)
Total Approved Budget Outlay for
1st and 2nd year
Head
Sr. No.
Punjabi University
K
L
0
7728
7728
6950
778
496
0
496
496
0
-9806
171130
161324
110000
51324
0
0
0
0
0
75092
7975
83067
0
83067
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
65782
186833
252615
117446
135169
SENSE MARKING DETAILS
Target: 1,00,000 words
 Division of Target between Punjabi University and
Thapar University

Punjabi
University
Thapar
University
Target
60,000
40,000
Complete
60,182
33,097
Remaining
0
6903
The sense marking task was divided into two parts with mutual
understanding as shown above. The Punjabi University Wordnet
Group has achieved its target.
Sense Marking Status
Sr. No.
Details
Punjabi
University
Thapar
University
Total
1
No. of Files
Used
45
53
98
2
Total Words
1,38,735
78,143
2,16,878
3
Total words
Sense Marked
60,182
33,097
93,279
4
Accuracy
43.11%
42.35%
43%
5
Target
Complete
Incomplete
RECORD OF SENSE MARKING WORK BY
PUNJABI UNIVERSITY
Type of
Corpus
No. of No. of
Files
Sentences
Total
Words
Sense
Marked
Words
Accuracy
News and
Articles
45
1,38,735
60,182
43.11
6716
Actions taken during Sense Marking
Action
One
Action Two
Action
Third
Action
Fourth
Total Actions
1102
30
388
269
1789
Words added in Punjabi Synset File by action one and two = 1132
STATUS OF SYNSET COMPLETED TILL 28 APRIL 2013
Sr.
No.
File Name
Total
Synsets
Complete
synsets
Remaining
synsets
1.
Universal
7168
4084
3084
7168
0
2.
Pan
Indian
1347
674
673
1347
0
3.
Verb
1798
807
991
1798
0
4.
Adverb
209
105
104
209
0
5.
Adjective
3605
1802
1803
3605
0
6.
Noun
16862
5188(TU`s
22050
Completed
by Pbi. Uni.
11026
complete
Completed
by TU
5836/11024
incomplete
The synset creation task was divided into two parts with mutual
understanding as shown above.
Punjabi University Group has completed its synset creation task.
task)
POS CATEGORY SYNSETS COMPLETED
Category
Total Synsets
Noun
19598
Verb
2836
Adjective
5828
Adverb
443
Total
28705
INTERNAL VALIDATION DETAILS
The validation task is being done by Punjabi University
WordNet Group.
Sr.
NO.
File Name
No. of
Synsets
Validated
Synsets
Words
Added
1.
Universal File
7168
7168
1287
80
2.
Adverb File
209
209
43
15
3.
Adjective File
3605
3605
340
115
4.
Pan Indian File
1347
450
253
103
5.
Verb File
1798
0
6.
Noun File
22050
0
36177
11432
1923
313
Total
Words
Deleted
PUNJABI LANGUAGE SPECIFIC SYNSETS
Total:
1010
 Noun:
961
 Adjective: 16
 Verb:
33

NEW SYNSET CREATION FOR HINDI
WORDNET



New common synsets created by Punjabi University which
were not present in Hindi WordNet (Total 50)
भाजपाई, अकाली, बाबा बंदा स हं बहादरु , मध्धर, ुलग्ग, दआ
ु नी, कणकवंना, भागवान,
मलटी ब्ांड, फिरकू, ामाजवादी पाटी, तण
ृ मूल कांग्रे , पी .जी .आई. ऐम. ई. आर.,
नुक्कड़ नाटक, हफ़ीजाबाद, ब्यूटी पाललर, लालपरी, मैक, माक् व
ल ाद, माक् व
ल ादी, बरनाला
जजला, बरनाला शहर, अ ंवेदनशील, कॉल ेंटर, पीजा, राष्ट्रीय ुरक्षा पिरदद, पीली नदी,
बे बॉल, डडस्पें री, स्टॉकटन शहर, परमवीर चक्र, महावीर चक्र, क़ीर्तल चक्र, शौयल चक्र,
िा ट िूड, स्रीट िूड, जंक िूड, गुरू नानक दे व युर्नवस ट
ल ी, पद्म श्री, डडप्टी
इन पैकटर जनरल आफ पुसल , डडप्टी जनरल आफ पुसल , ब - डडवीजनल अफ र,
लहहंदा पंजाब, पूबी पंजाब, फकला लोहगड़्ह, अटारी, बादल, ग़दर पाटी, ग़दर अख़बार,
कतालर स हं
राभा
These words are taken from the Different Punjabi online
Newspapers like DailyAjit, PunjabiTribune, Charhdikala
PROBLEMS OCCURRING IN SENSE MARKING
Problems related to English words
 Problems related to compound words
 Problems related to adjective
 Problems related to proverbs
 Problems related to verbs

Problems Related to English Words
Borrowed or Accepted English Words
 Comparative alternative present in the WordNet
 Not found in WordNet
 Proper sense not present
 Abbreviations

FIVE TYPES OF ENGLISH WORDS IN CORPORA
Sr. no.
1.
2.
Accepted English
Words
Comparative
Not found
alternatives present in WordNet
the WordNet
pen, bus, car,
computer, cycle, etc
history, road, city, book,
popular, portal, school
networking,
Olympian, Nokia,
Samsung
tower (mobile
tower), call (phone
call), server (web
server), depression
(psycho related),
interview
VAT,MRF, HMV, PPSC,COAI,
HIV
NABARD,PSEB
, CBSE, DIG,IG,
BA, BBA,BCA,
IIT,DU,PU,PTU,
BBC etc.
2. Problems No problem
in sense
marking
1.how it should be
tagged?
Note: If we will add
these words as
synonyms then there
will be thousands of
words which are in use.
No tag
No tag
No tag
3.
Suggestions
English words should be
selected according to
their frequency of usage
in Indian languages.
Creation of new
synsets
Creation of new
synsets with
proper sense
Creation of new synsets with
short and full forms
1.
Words
3.
4.
5. Abbreviations
in Proper sense not Only
present
in abbreviation in
use
WordNet
Full and short
forms both in
use
No tagg
PROBLEMS RELATED TO COMPOUND
WORDS




Most of the common compound words do not exist in the
WordNet. If we mark these compounds separately, the
actual sense they infer is lost. For example:
ਧੁੱ ਪ-ਛਾਂ, ਜੋੜ-ਤੋੜ, ਚਸਤ-ਦਰਸਤ, ਗੰ ਢ-ਤਪ, ਤੁੱ ਥ-ਮਥ, ਮੰ ਡੇ-ਕੜੀਆਂ, ਚੰ ਗੇ-ਭਲੇ , ਮੈਲੇ-ਕਚੈਲੇ, ਚੰ ਗੇ-ਭਲੇ , ਢੰ ਗਤਰੀਕੇ, ਪੂਰਾ-ਪੂਰਾ, ਸ਼੍ਰੇਣੀ-ਵੰ ਡ, ਮਾਣ-ਸਤਤਕਾਰ, ਛੋਟੇ-ਛੋਟ,ੇ ਸੋਚੇ-ਸਮਝੇ, ਕੁੱ ਚ-ਸੁੱ ਚ, ਅੁੱ ਖੋਂ-ਪਰੋਖੇ, ਸੁੱ ਚੇ-ਸੁੱ ਚੇ, ਬਾਗੋਬਾਗ, ਰੋਕ-ਟੋਕ, ਬਰਾ-ਭਲਾ, ਦੂਰ-ਨੇੜੇ, ਤਦਨ-ਰਾਤ, ਪੁੱ ਛਣ-ਦੁੱ ਸਣ, ਕਤਿਣ-ਸਣਨ, ਿਾਂ-ਨਾ
Translation
धप
ू -छाया, जोड़ - तोड़ , हदन - पहदन , चस्
ु त - दरु
ु स्त , गााँठ - तुप , तुथ्थ - मथ
ु , लड़के - लड़फकयााँ ,
अच्छे - भले , मैले - कुचैले , अच्छा - भला , ढं ग - तरीके , परू ा - परू ा , श्रेणी - ववभाजन, गवल - त्कार ,
छोटे - छोटे , ोचे - मझे , कााँच - त्य , आाँखों - परोखा , त्य - च्
ु चा , बागो - बाग , (with
trans.tool assistance) ववघ्न – टोक, बरु ा - भला , दरू – र्नकट(में ), हदन - रात , पछ
ू ना - बताने ,
कहने - न
ु ना , हााँ - न
PROBLEMS RELATED TO ADJECTIVE



Feminine Gender:
Feminine forms of adjective are not included in
the WordNet, but these occur frequently in the
text and reference materials.
Some examples:
ਸੋਿਣੀ,ਲੰਬੀ, ਛੋਟੀ, ਫ਼ਰਤੀਲੀ,ਸਾਂਝੀ, ਸਨਤਿਰੀ, ਸੌਖੀ, ਮੋਟੀ, ਤਨਿੱਕੀ, ਉੱਚੀ
ोहनी,लंबी,छोटी,िुरतीली, ांझी, न
ु हरी,
ौखी, मोटी, र्नक्क़ी, उच्ची
PROBLEMS RELATED TO PROVERBS
Proverbs are not included in the Hindi WordNet.
We are marking them word by word. How we can
mark them?



1. ਨੀਮ ਿਕੀਮ ਖਤਰਾ ਏ ਜਾਨ
2.ਕੁੱ ਲਾ ਇਕ ਦੋ ਤਗਆਰਾਂ
3. ਕਦਮ ਦਾ ਖੰ ਤਝਆ ਕੋਿਾਂ ਤੇ ਪੈਂਦੈ
Transliteration
1. नीम हक़ीम खतरा ए जान
2. कल्ला इक दो गगआरां
3. कदम दा खंझु झआ कोहां ते पैंदै
English Translation
1.
little knowledge is a dangerous thing
2.
two heads are better than one
3.
a miss by an inch is a miss by a mile
FOLLOWING FIELDS OF WORDS ADDED
IN SYNSET FILE DURING SENSE
MARKING TASK



Sports: ਚੈਂਪੀਅਨ, ਤਕਟ, ਫਾਈਨਲ, ਸੀਰੀਜ਼, ਤਵਸ਼੍ਵ ਕੁੱ ਪ, ਕਮੈਂਟੇਟਰ, ਬੇਸਬਾਲ,
ਏਸ਼੍ੀਅਨ ਖੇਡ, ਏਸ਼੍ੀਆ ਕੁੱ ਪ, ਚੈਂਪੀਅਨ ਟਰਾਫੀ, ਜਾਫੀ, ਧਾਵੀ, ਰੇਡ, ਕਾਮਨਵੈਲਥ
ਖੇਡ
Business: ਮਲਟੀ ਬਰਾਂਡ, ਸਰਤਵਸ ਟੈਕਸ, ਟੈਂਡਰ, ਪੈਕੇਜ, ਵੈਟ, ਕਰੰ ਸੀ, ਸਵੈਰਜ਼ਗਾਰ, ਆਊਟ-ਸੋਰਤਸੰ ਗ,
Politics: ਅਕਾਲੀ ਦਲ, ਕਰਸੀ, ਭਾਜਪਾਈ, ਕੈਬਤਨਟ, ਤਤਰਣਮੂਲ ਕਾਂਗਰਸ,
ਂ .ਐਲ
ਂ .ਏ., ਅਸੈਂਬਲੀ
ਜਨਤਾ ਦਲ, ਿਾਈਕਮਾਨ, ਸੰ ਤਵਧਾਨਕ, ਐਮ
SUGGESTIONS-I




There should be a separate button on the IndoWordNet
Website for common vocabulary (words that has same sense
in all languages) of all the languages.
There should be a separate button on the IndoWordNet
Website for the word frequency list of word for each language.
There should be a separate button on the IndoWordNet
Website for the borrowed word list of each language.
There should be a separate button on the IndoWordNet
Website for the Great Personalities names of all the
languages.
SUGGESTIONS-II
We should prepare some parametres about entries of:
 Places
 Institutions
 Famous personalities
 Famous creations: books, films, paintings, music etc.
 Famous incidents and dates
 Scientific vocabulary
 And words from other special fields
 Etc.
These parametres, help us in creating new synsets and
Language Specific Synsets (LSS).
TEAM COMPOSITION
P.I. details
 Dr. Suman Preet, Associate Professor & Head, Dept of Linguistics
and Pbi. Lexicography, Punjabi University, Patiala.
 Co-P.I. details
 Dr. Harjeet Gill, Professor Eminence, Pbi. Uni., and Prof. Emeritus
JNU.

DETAILS OF THE MANPOWER ASSOCIATED WITH THE
PROJECT
Staff details
Miss Balwinder Kaur, M.A. (Pbi.), PhD (in cont.)
Designation: Senior Linguist
Work Details: Linking synsets, Validating synsets,
Creating & monitoring Language Specific Synsets
Salary : 22,000/- p.m.
Mr. Satpal Singh, M.A. (Eng, Linguistics), Diploma in Persian, B.Ed.
Designation: Lexicographer
Work Details : Linking synsets, Validating synsets, Sense Marking
Salary : 16,500/- p.m.
DETAILS OF THE MANPOWER ASSOCIATED WITH THE
PROJECT (CONTD.)
Mr. Vinay Hasija, B. Tech. (Computer Engg)
Designation: Lexicographer
Work Details: Validating synsets, Website creation, Sense
Marking
Salary: 16,500/- p.m.