WordNet, EuroWordNet, Balkanet

Download Report

Transcript WordNet, EuroWordNet, Balkanet

WordNet, EuroWordNet, Balkanet
Faculty of Informatics MU
Karel Pala
[email protected]
Overview
•
•
•
•
•
•
Starting points
What is WordNet?
EuroWordNet 1, 2
Balkanet
Cz. WordNet
Tools
Starting points I
• G. A. Miller – founder of psycholexicology,
• Model of human lexical memory, associations
• Hierarchical organizations of nouns in human
memory
• A canary can sing. x A canary can fly. x A
canary has skin.
• Canary – can sing – time t1 (answer: true)
• Bird – can fly – time t2, (answer: true)
• Animal – has skin – time t3, where t1<t2 <t3
• Generic information is not stored redundantly
Starting points II
• Humans easily process anaphoric expressions
• He has a rifle, but this weapon has never been used.
• Alphabetic vs. hierachical ordering entries in dictionaries,
capturing semantic relations
• Genus proximum (hypero/hyponyms), siblings (tree vs.
pine, oak, beech, fir, spruce, lime tree)
• Machine-readable dictionaries – problems with data
organization dat – alphabetic ordering separates pieces of
information belonging naturally together: dog, coyote,
hyena
• Lexical databases (WordNet), thesauri (Roget) vs.
classical or standard dictionaries
Princeton WordNet 1.5-2.0
• Net of words, English as first language: WN v.1.5, 1.7,
2.0, approx. 100 tis. synsets
• Nouns (60 thous.), verbs (11 thous.), adjectives, adverbs,
function expressions (synsemantic ex.)
• Synsets: [(List of synonyms), (POS), (Gloss), (Semantic
Relations), ID], {driver:1 (n), the operator of a motor
vehicle, H/H, ID:ENG20-09277009-n}
• Semantic relations between synsets: synonymy,
hypero/hyponymy, antonymy, holo/meronymy (lexical
system with inheriting), large network
• top hyperonyms – 25, later Top Onto – 63, BCs:1053
• Nodes in H/H trees can be understood as semantic
features, up to 13 levels with nouns, about 6 for verbs
Princeton WordNet II
• PWN has been developed by G.A. Miller and his group
in Princeton
• It is free and exists for all platforms and can be
downloaded at the address:
clarity.princeton.edu
• Simple browser allowing to export selected data for
further processing can be downloaded as well
• Standard database format – now it is possible to convert
it into XML format and use VisDic (www pages FI MU)
• PWN as such is not based on any corpus data, this
negatively influences sense discrimination, it was done
introspectively.
EuroWordNet 1 and 2
• New features in comparison with PWN:
- multilinguality: 8 languages – En, It, Du, Sp, Ge,
Fr, Esto, Cz, Interlingual Index (ILI)
- Top ontology (TO, 63 beginners), the set of Base
Concepts (1053)
- internal language relations (ILR),
semantic roles Ag, Pat, Instr, Loc, …, with synsets.
- browser and editor – Polaris 1.5 (licensed),
free: browser Periscope, ELDA/ELRA CD
- example: Top Ontology scheme
EuroWordNet II
• Building the individual WordNets using the set of Base
Concepts (BCs)
• Translation equivalents and lexical gaps – Interlingual
Index (ILI)
• Problems with too fine grained sense discrimination in
PWN:
- e.g. verb to get has about 35 senses in PWN
- in NODE only basic 8 (+ subsenses)
• Problems with typological differences between
languages: verb aspect, diminutives, prefixation, virtual
(empty) nodes
Balkanet
• Continuation of EWN, next 5 languages being added:
Gr, Turk, Ro, Bg, S-Cr, (Cz continues), 2001-04
• New features in comparison with EWN: larger set of
BCs – up to 8000 synsets, more stress put on capturing
the differences between languages
• New data representation – using XML format, serious
approach to standardization
• New tool – editor and browser VisDic (by FI MU)
• More attention has been paid to data validation,
particularly, multilingual corpus 1984 (Orwell) has been
used for this purpose.
Czech WordNet
• Synsets so far do not contain Czech glosses (definitions),
the English ones are used, they will be added
• Verb synsets are supplemented with verb frames that are
associated with the individual senses
• V. frames contain the surface (cases) valencies and also
deep valencies containing the general semantic roles such
as AG, PAT, ADR together with the selectional
constraints exploiting literals taken directly from PWN 2.
0, for example: [{obléci, obléct, obléknout}
kdo1*AG(person:1)=co4*ART(garment:1)]
• There is an attempt to exploit the verb semantic classes
introduced by Levin (1993), for each verb a number of
the semantic class is given,
• At the moment it includes approx.3500 sloves (Cz, Eng).
Tools
• VisDic – local tool with journaling, XML format,
basic unit: synset, consists of the literals
- main functions: browsing, editing, export, projection
• New tool DEB – client/server arch., XML formats, - basic
unit: literal, capturing relations between literals,
integration with morphological analyzer Ajka and corpus
manager Bonito and other modules or resources
• Morfological module (for Czech) Ajka
• Interface SAFT – integrating Czech WordNet with
morfological analyzer Ajka – possibility to process free
(corpus) text
• Working on the integration with the partial parser
DIS/VADIS, it will be possible to exploit lexical
information and semantic features in WordNet during
Applications of WNs
• Machine Translation – WNs can be used as a new type
of dictionary thanks to synsets (synonymy relation and
H/H relation
• IE – information extraction, allows to follow semantic
relations in text, and exploit multilinguality
• Useful with web browsers: synonyms and H/H
relations, experiments show improvement from approx.
– 13 % without WN to approx. 60 % (after query
extension, experiments for English only)
• Word Sense Discrimination – as a data resource for
sense recognition
• Knowledge representation, inference relying on word
meanings, relations to Semantic Web
Literature
• G. A. Miller et al, Five Papers on WN, 1993,
rev.version, clarity.princeton.edu,
• EuroWordNet, final report, CD ROM with data, 1999,
www pages EWN, distributed by ELDA/ELRA (Paris)
• P. Vossen et al., EuroWordNet, book publ. by Kluwer
• www pages of Global WordNet Association (GWA, P.
Vossen, Ch. Fellbaum)
• www pages of Balkanet Project, Final report 2004
• www pages of Second Global WordNet Conference,
Brno, 20.-23.1.2004
• www pages – NLP Lab. FI MU in Brno, VisDic page.
Top hyperonyms in WordNet 1.5
• act, action, activity (činnost, aktivita)
• animal, fauna (zvíře, fauna)
• artefakt (výtvor, výrobek)
lidská bytost)
• attribute, property (atribut, vlastnost)
• body, corpus (tělo, těleso)
• cognition, knowledge (znalost, poznání)
• communication (komunikace, sdělování)
•
•
•
•
•
•
event, happenning (událost)
feeling, emotion (pocit, emoce)
food (potrava, jídlo)
group, collection (skupina, soubor)
location, place (umístění, místo)
motive (motiv)
natural object (fyzický objekt)
natural phenomenon (přírodní
jev)
person, human being (osoba,
plant, flora (rostlina, flora)
possession (vlastnictví)
process (proces)
quantity, amount (kvantita,
množství)
relation (vztah)
shape (podoba, tvar)
state, condition (stav)
substance (substance, látka )
time (čas)