Wordnet from A to Z

Download Report

Transcript Wordnet from A to Z

Wordnet from A to Z
CV E TA NA KRST EV
UN I V ERSITY OF BE LG R ADE, FACULT Y OF P HI LOLOGY
DE PA RTMENT OF L I BR A RY A N D I N FOR MATION S CI E NCES
Outline of my talk
History
What is WordNet?
◦ A Concept vs. Lexical form
◦ Relations
◦ Practice
Development Projects
Usage
Enhancements
Wordnets in the world
2
What is WordNet?
3
What Wikipedia says about WordNet
WordNet is a lexical database for the English language.
It groups English words into sets of synonyms called synsets, provides short definitions and
usage examples, and records a number of relations among these synonym sets or their
members. WordNet can thus be seen as a combination of dictionary and thesaurus.
While it is accessible to human users via a web browser, its primary use is in automatic text
analysis and artificial intelligence applications. (?)
The database and software tools have been released under a BSD style license and are freely
available for download from the WordNet website. Both the lexicographic data (lexicographer
files) and the compiler (called grind) for producing the distributed database are available.
4
Authors of the (first) WordNet
WordNet was created in the Cognitive Science Laboratory
of Princeton University under the direction
of psychology professor George Armitage Miller starting in 1985 and
has been directed in recent years by Christiane Fellbaum
◦ That is why it is usually called „the Princeton WordNet“ (PWN)
George Miller and Christiane Fellbaum were awarded the
2006 Antonio Zampolli Prize for their work with WordNet.
5
What do authors say about this
resource?
Abstract:
◦ WordNet is an on-line lexical reference system whose design is inspired by current
psycholinguistic theories of human lexical memory. English nouns, verbs, and adjectives are
organized into synonym sets, each representing one underlying lexical concept. Different relations
link the synonym sets.
Miller, George A., et al. "Introduction to wordnet: An on-line lexical
database*." International journal of lexicography 3.4 (1990): 235-244.
More details can be found in „5papers“:
http://wordnetcode.princeton.edu/5papers.pdf
6
What do authors say about this
resource?
Summary:
◦ In a modern, computer era, alphabetic search for words is not enough;
◦ „...however,... it is grossly inefficient to use these powerful machines as little more than rapid
page-turners.“
◦ „Beginning with word association studies at the turn of the century ..., psycholinguists have
discovered many synchronic properties of the mental lexicon that can be exploited in
lexicography.“
◦ „The initial idea was to provide an aid to use in searching dictionaries conceptually, rather
than merely alphabetically—it was to be used in close conjunction with an on-line dictionary
of the conventional type.“
7
Synset – the basic unit of WordNet
o Synset – synonym set;
o A synset is a representation of a concept – a definition is added only to
facilitate development and usage;
o Instead of talking about „words“, when talking about WordNet, in order to
reduce ambiguity ‘‘word form’’ or „literal“ is used to refer to the physical
utterance or superficial form and ‘‘word meaning’’ to refer to the lexicalized
concept that a form can be used to express.
o „These synonym sets (synsets) do not explain what the concepts are; they
merely signify that the concepts exist.“
8
A wordform – concept relation
o This relation is many-to-many
o Example:
o {board, plank}
- def: a stout length of sawn timber; made in a wide variety of sizes
and used for many purposes
o {board, table}
- def: food or meals in general; usage: „she sets a fine table“; „room
and board“
o A concept can be lexicalized by several word forms (one concept – two word
forms, board and plank)
o A word form can be used for lexicalization of several concepts (one word form
– board – can be used for two and many more concepts)
9
What are synonyms?
According to one definition two expressions are synonymous if the substitution of one for the
other never changes the truth value of a sentence in which the substitution is made.
By that definition, true synonyms are rare, if they exist at all.
A weakened version of this definition would make synonymy relative to a context: two
expressions are synonymous in a linguistic context C if the substitution of one for the other in C
does not alter the truth value.
For example, the substitution of plank for board will seldom alter truth values in carpentry
contexts, although there are other contexts of board where that substitution would be totally
inappropriate.
It is convenient to assume that the relation is symmetric: if x is similar to y, then y is equally
similar to x.
10
Partitioning of WordNet
o The definition of synonymy in terms of substitutability makes it necessary to
partition WordNet into nouns, verbs, adjectives, and adverbs.
o If concepts are represented by synsets, and if synonyms must be
interchangeable, then words in different syntactic categories cannot be
synonyms (cannot form synsets) because they are not interchangeable.
o Nouns express nominal concepts, verbs express verbal concepts, and
modifiers provide ways to qualify those concepts.
o The use of synsets to represent word meanings is consistent with
psycholinguistic evidence that nouns, verbs, and modifiers are organized
independently in semantic memory.
11
Other relations - antonymy
o The antonym of a word x is sometimes not-x, but not always. For example, rich and poor are
antonyms, but to say that someone is not rich does not imply that they must be poor.
o Antonymy is a symetric relation;
o Antonymy is a lexical relation between word forms, not a semantic relation between word
meanings.
o Example:
◦ the meanings {rise, ascend} and {fall, descend} are conceptual opposites, but they are not antonyms;
◦ rise/fall and ascend/descend are antonyms
◦ but most people would reject rise and descend, or ascend and fall, as antonyms
12
Hyponymy/hypernymy (1)
o Called also subordination/superordination, subset/superset, or the ISA
relation)
o hyponymy/hypernymy is a semantic relation between word meanings, not a
lexical relation between word forms.
o Example:
◦ {maple} is a hyponym of {tree}, and {tree} is a hyponym of {plant}
o A concept represented by the synset {x, x′,...} is a hyponym of the concept
represented by the synset {y, y′,...} if one can say (in English) „An x is a (kind
of) y“.
13
Hyponymy/hypernymy (2)
o Hyponymy is transitive and asymmetrical, and, since there is normally a single superordinate,
it generates a hierarchical semantic structure, in which a hyponym is said to be below its
superordinate.
o A hyponym inherits all the features of the more generic concept and adds at least one
feature that distinguishes it from its superordinate and from any other hyponyms of that
superordinate
o Example:
◦ maple inherits the features of its superordinate, tree, but is distinguished from other trees by the
hardness of its wood, the shape of its leaves, the use of its sap for syrup, etc.
o This relation is the central organizing principle for the nouns in WordNet, also for verbs, but
noun hierarchy is mush deeper.
14
Meronymy/holonymy (1)
o Called also part-whole or HASA relation
o A concept represented by the synset {x, x′,...} is a meronym of a concept represented by the
synset {y, y′,...} if one can say (in English) that „A y has an x (as a part)“ or „An x is a part of y“.
o The meronymic relation is transitive (with qualifications) and asymmetrical
o It can be used to construct a part hierarchy
o Example:
◦ {mouth, muzzle} is a meronym of {face, countenance}
◦ {wheel} is a meronym of {wheeled vehicle} (not of {vehicle}, because there are vehicles without wheels
- parts are not inherited “upward” )
15
WordNet in practice – Princeton
Wordnet
o Example of one noun synset:
o Synset
◦ {dog, domestic_dog, Canis_familiaris}
o Definition
◦ a member of the genus Canis (probably descended from the common wolf) that has been
domesticated by man since prehistoric times; occurs in many breeds;
o Usage
◦ "the dog barked all night"
16
Dog – upward hierarchy
{entity}
{physical_entity}
{object, physical_object}
{whole, unit}
{living_thing, animate_thing}
{organism, being}
{animal, animate_being, beast, brute, creature, fauna}
{chordate}
{domestic_animal, domesticated_animal}
{vertebrate, craniate}
{mammal, mammalian}
{placental, placental_mammal, eutherian, eutherian_mammal}
{carnivore}
{canine, canid}
{dog, domestic_dog, Canis_familiaris}
17
Dog –downward hierarchy
{dog, domestic_dog, Canis_familiaris}
{puppy_dog}
{hunting_dog}
{hound, hound_dog}
{basset, basset_hound}
{working_dog}
...
18
25 unique beginners for noun synsets
{act, action, activity}
{food}
{possession}
{animal, fauna}
{location, place}
{process}
{artifact}
{motive}
{quantity, amount}
{attribute, property}
{group, collection}
{relation}
{body, corpus}
{natural object}
{shape}
{cognition, knowledge}
{natural phenomenon}
{state, condition}
{communication}
{person, human being}
{substance}
{event, happening}
{plant, flora}
{time}
{feeling, emotion}
19
Organization of top levels
entity
physical entity
physical object
location
process
psychological_feature
attribute
relation
state shape
whole, unit
natural object living thing
substance
abstraction
artifact
organism
person
animal
plant
20
Dog – additional relations
o MemberHolonym
◦ {Canis, genus_Canis}
◦ Def:type genus of the Canidae: domestic and wild dogs; wolves; jackals
◦ {pack} (dog is a member of a pack)
◦ Def: a group of hunting animals
o PartMeronym
◦ {flag} (flag is a part of a dog)
◦ Def: a conspicuously marked or shaped tail
21
Meronymy/holonymy (2)
Three types of meronymy/holonymy relation:
◦ PartHolonym (mouse button is a part of a computer mouse)
◦ {mouse, computer_mouse} (def: a hand-operated electronic device that controls the coordinates of a
cursor...
◦ {mouse_button} (Def: a push button on the mouse)
◦ MemberHolonym (a rodent is a member of Rodentia)
◦ {rodent, gnawer} (def: relatively small placental mammals having a single pair of constantly growing
incisor...)
◦ {Rodentia, order_Rodentia} (def: small gnawing animals: porcupines; rats; mice; squirrels; marmots;
beavers; gophers; ...)
◦ SubstanceHolonym (protein is a substance of milk)
◦ {protein} (def: any of a large group of nitrogenous organic compounds that are essential constituents of
living beings)
◦ {milk} (def: a white nutritious liquid secreted by mammals and used as food by human beings)
22
Antonymy – between different Part-ofSpeech
Verbs
◦ {open, open_up}
◦ def: cause to open or to become open;
◦ Antonym: {close, shut}
◦ def: move so that an opening or passage is obstructed; make shut;
Nouns
◦ {sadness, unhappiness}
◦ def: emotions experienced when not in a state of well-being
◦ Antonym: {joy, joyousness, joyfulness}
◦ def: the emotion of great happiness
Adjectives
◦ {ugly}
◦ def: displeasing to the senses
◦ {beautiful}
◦ def: delighting the senses or exciting intellectual or emotional admiration;
23
To fry – (shallow) hierarchy
{fry}: cook on a hot surface using fat; "fry the pancakes„
{change}
{change_integrity}
{cook}
{fry}
{frizzle}
{deep-fat-fry}
{pan-fry}
...
24
Verb clusters
Verbs of Bodily Functions and Care (sweat)
Motion Verbs (move)
Verbs of Change (change)
Emotion or Psych Verbs (feel)
Verbs of Communication (tell)
Stative Verbs (have, wear)
Competition Verbs (race)
Perception Verbs (see)
Consumption Verbs (drink)
Verbs of Possession (possess, own)
Contact Verbs (touch)
Verbs of Social Interaction (request, impeach)
Cognition Verbs (think)
Weather Verbs (thunder)
Creation Verbs (create)
25
Other verb relations
Cause (1)
◦ {burn, combust}
◦ def: cause to burn or combust;
◦ Usage: "The sun burned off the fog"; "We combust coal and other fossil fuels„
◦ {burn, combust}
◦ def: undergo combustion;
◦ Usage: "Maple wood burns well„
Cause (2)
◦ {feed, give}
◦ Def: give food to
◦ Usage: "Feed the starving children in India";
◦ {eat}
◦ Def: take in solid food;
◦ Usage: "She was eating a banana"
26
Other verb relations (2)
Entailment - the relation between two verbs V1 and V2 that
holds when the sentence Someone V1 logically entails the
sentence Someone V2
◦ {abort}: terminate a pregnancy by undergoing an abortion} entails
◦ {conceive}: become pregnant; undergo conception
◦ {snore, saw_wood, saw_logs}: breathe noisily during one's sleep
entails
◦ {sleep, kip, slumber, log_Z's, catch_some_Z's}: be asleep
27
Other relations
Cross-Part-Of-Speech
◦ Attribute:
◦ adjective {perfect} – noun {perfection, flawlessness, ne_plus_ultra}
◦ Adjective {clean} – noun {cleaness}
◦ Derivationally related:
◦ Verb
{abort} – noun {abortion}
◦ Adjective {dirty, soiled, unclean} - noun {dirtiness, uncleanness}
◦ Similar (all Part-Of-Speach}
◦ Adjective {dirty, soiled, unclean} - {unwashed}, {sooty}, {maculate}, {greasy, oily}...
◦ SeeAlso (all Part-Of-Speach}
◦ Adjective {dirty, soiled, unclean} - {untidy}
28
TopicDomain
{cooking, cookery, preparation}: the act of preparing something (as food) by the
application of heat
◦ Verb {fry}: cook on a hot surface using fat
◦ Noun {curry}: (East Indian cookery) a pungent dish of vegetables or meats flavored with curry
powder and usually eaten with rice
{sport, athletics}: an active diversion requiring physical exertion and
competition}
◦ Adjective {loose}: (of a ball in sport) not in the possession or control of any player
◦ Noun {offside}: (sport) the mistake of occupying an illegal position on the playing field (in
football, soccer, ice hockey, field hockey, etc.)
◦ Verb {shoot}: score; "shoot a basket"; "shoot a goal"
29
InstanceHyponym
{athlete, jock}
{player, participant}
{tennis_player}
{receiver}
{tennis_pro, professional_tennis_player}
{Evert, Chris_Evert, Chrissie_Evert, Christine_Marie_Evert}
{King, Billie_Jean_King, Billie_Jean_Moffitt_King}
{Navratilova, Martina_Navratilova}
{Seles, Monica_Seles}
Novak Đoković ?
30
WordNet 3.0 statistics (according to Piek
Vossen, VU University Amsterdam)
POS
Unique strings
Noun
117,798
Verb
11,529
Adjective
21,479
Adverb
4,481
Total
Synsets Word-Sense Pairs
82,115
146,312
13,767
25,047
18,156
30,002
3,621
5,580
155,287 117,659
206,941
31
Projects
32
EuroWordNet
(project: March 1996 – June 1999)
o EuroWordNet is a multilingual database with wordnets for several European
languages (Dutch, Italian, Spanish, German, French, Czech and Estonian).
o The wordnets are structured in the same way as the American wordnet for English in
terms of synsets (sets of synonymous words) with basic semantic relations between
them.
o Each wordnet represents a unique language-internal system of lexicalizations.
o In addition, the wordnets are linked to an Inter-Lingual-Index, based on the Princeton
wordnet. Via this index, the languages are interconnected so that it is possible to go
from the words in one language to similar words in any other language.
o The index also gives access to a shared top-ontology of 63 semantic distinctions. This
top-ontology provides a common semantic framework for all the languages
33
Vossen, P. "From WordNet to EuroWordNet to the Global WordNet Grid: anchoring
languages to universal meaning." Guest lecture, Language Engineering Applications,
February, 26th (2009).
34
Multilingual Balkan Wordnet
IST-2000-29388 [September 2001 – August 2004 ]
The project consortium consisted of 13
institutions from:
Bulgaria
Greece
Romania
Serbia
Turkey
France
Nederland
Czech Republic
http://www.dblab.upatras.gr/balkanet/index.htm
35
The aims of the BalkaNet project
o The development of the multilingual resources for the Balkan
languages (Bulgarian, Greek, Romanian, Serbian, Turkish, and Czech)
o The enhancement of the semantic network EuroWordNet
o The definition of Balkan specific concepts
o The integration of semantic networks into applications based on
natural language processing (e.g. classification of web documents)
36
37
Development models
There are two main models for building a multilingual wordnet:
◦ A merge model consists of building a language specific wordnet independently
from other wordnets (and from PWN)
◦ Used in EuroWordnet (in a second phase the correspondences between individual wordnets
was established), Polish Wordnet (plWordNet 2.0)
◦ A expand model (translation-based model) consists of building a language
specific wordnet keeping as much as possible of the semantic relations
available in PWN. This is done by building the new synsets in correspondence
with the PWN synsets, whenever possible, and importing semantic relations
from the corresponding English synsets;
◦ Used in Balkanet project and many other projects
38
Balkan specific concepts
o a concept specific for a particular Balkan language (стара
штедња ‘foreign currency saving accounts frozen by factual
bankruptcy’ for Serbian),
o a concept originating from one Balkan language which has spread
to other Balkan and European languages (Атентат у Сарајеву
‘the assassination in Sarajevo’),
o a concept which is not necessarily specific for the Balkans only, but
which is recognized as common in this area, while at the same
time it has not been registered in PWN (пирамидална банка
‘banks offering extremely high interest rates’).
39
Concepts recognized by all Balkan
languages
Bulgarian
Greek
Romanian
Serbian
Turkish
кадаиф
κανταΐφι
cataif
кадаиф
kadayıf
халва
χαλβάς
halva
алва
kağıt helva
40
Enhancements
41
Wordnet Domain Hierarchy
The WordNet Domains Hierarchy (WDH) is a language-independent
resource composed of 164, hierarchically organized, domain labels (e.g.
Architecture, Sport, Medicine).
WordNet Domains is a lexical resource developed at ITCirst where each
WordNet synset is annotated with one or more domain labels selected
from a domain hierarchy which was specifically created to this purpose.
The first version of the WDH was composed of 164 domain labels selected
starting from the subject field codes used in current dictionaries, and the
subject codes contained in the Dewey Decimal Classification (DDC), a
general knowledge organization tool which is the most widely used
taxonomy for library organization purposes.
More info: http://wndomains.fbk.eu/index.html
42
One of the five main trees in the
WordNet Domains original hierarchy
Other 4 trees are:
free_time
applied_science
pure_science
social_science
The label FACTOTUM was assigned in case all other labels could not be assigned.
43
One „word“ – many labels (domains) –
example board
Synset definition
domain
a flat portable surface (usually rectangular) designed for board game
play
a printed circuit that can be inserted into expansion slots in a computer to increase ...
computer science
electrical device consisting of a flat insulated surface that contains switches ...
electronics
a table at which meals are served
furniture
a vertical surface on which information can be displayed to public view
electronics
food or meals in general
food
a flat piece of material designed for a special purpose
factotum
a stout length of sawn timber; made in a wide variety of sizes and used for many purpose
buildings
a committee having supervisory powers
administration
44
SUMO – The Suggested Upper Merged
Ontology (SUMO)
o An ontology is a set of definitions in a formal language for terms describing
the world.
o An Upper Ontology is an attempt to capture the most general and reusable
terms and definitions.
o SUMO:
o 1000 terms, 4000 axioms (assertions), 750 rules;
o Mapped by hand to all of WordNet 1.6;
o
then ported to newer versions
o Associated domain ontologies totaling 20,000 terms and 60,000 axioms;
o Free
o
o
o
SUMO is owned by IEEE but basically public domain
Domain ontologies are released under GNU
www.ontologyportal.org
45
Adam Pease
Articulate Software
Presented at PANL1On
46
47
Relations between SUMO concepts and
Wordnet Synsets
o Synonymy
o {battle, conflict, fight, engagement} -> SUMO Battle= (Domain: history)
o Subordination
o {naval_battle} -> SUMO Battle+ (Domain: history)
o Instance
o {Trafalgar, battle_of_Trafalgar} -> SUMO Battle@ (Domain:history)
o Less straightforward
o
o
o
o
{writer, author} -> SUMO authors= (Domain: literature)
{dramatist, playwright} -> SUMO Position+ (Domain: literature)
{poet} -> SUMO authors+ (Domain: literature)
{Brecht, Bertolt_Brecht} -> SUMO Man@ (Domain:literature)
48
Wordnet to SUMO Mapping and SUMO
formalism
{plant, flora, plant_life}: (botany) a living organism lacking the power of
locomotion
SUMO: Plant = (domain: biology
SUMO has axioms that explain formally what a plant is
(=>
(and
(instance ?SUBSTANCE PlantSubstance)
(instance ?PLANT Organism)
(part ?SUBSTANCE ?PLANT))
(instance ?PLANT Plant))
49
Why are SUMO and WordNet important
o Semantic word sense disambiguation
o “The board approved the pay raise.”
o Piece of wood, or corporate government?
o Anaphoric resolution
o “Betty saw Susan asleep on the couch. She put her to
bed.”
o Sleeping people do not perform intentional actions
50
SentiWordNet
SentiWordNet is a lexical resource explicitly devised for supporting sentiment
classification and opinion mining applications.
SentiWordnet is the result of automatically annotating all WORDNET synsets
according to their degrees of positivity, negativity, and neutrality.
Each synset s is associated to three numerical scores Pos(s), Neg(s), and Obj(s)
which indicate how positive, negative, and “objective” (i.e., neutral) the terms
contained in the synset are.
Each of the three scores ranges in the interval [0.0, 1.0], and their sum is 1.0 for
each synset.
51
SentiWordNet
o Different senses of the same term may have different opinion-related
properties.
o Example for the adjective estimable from SentiWordNet 1.0:
o {computable, estimable} def: may be computed or estimated Pos=0, Neg=0,
Obj=1.
o {estimable} def: deserving of respect or high regard Pos=0.75, Neg=0.0,
Obj=0.25.
52
Usage
53
Software
o EuroWordNet - Polaris: a wordnet editing tool for creating, editing and
exporting wordnets
o Balkanet – VisDic: XML-based WordNet editor
o DEBVisDic: a client-server application that was used for the editing of several
WordNets ((Dutch in Cornetto project, Polish, Hungarian, several African
languages, Chinese)
o Many research teams have developed their own development software
o Example: for Serbian – SWNE http://sm.jerteh.rs/Default.aspx hosted by JeRTeh, Society for
Language Resources and Technologies (Serbia)
54
Usage of wordnets
o Improve recall of textual based analysis:
o Query → Index
o
o
o
o
o
Synonyms: commence → begin
Hypernyms: taxi → car
Hyponyms: car → taxi
Meronyms: trunk → elephant
Lexical entailments: used a gun → shot
o Inferencing:
o what things can be used for transport?
o Expressions in language generation and translation:
o alternative words and paraphrases
55
Recall improvement
Improvemnet of web serch
◦ For Serbian VebRanka (http://hlt.rgf.bg.ac.rs/VeBranka/About.aspx?param=1)
Anaphora resolution:
◦ The girl fell off the table. She.... / –The glass fell of the table. It...
Coreference resolution:
◦ When he moved the furniture, the antique table got damaged.
◦ The young puppy damaged the furniture. The pet felt at home.
Summarizers:
◦ Sentence selection based on word counts → concept counts
Named entity types: detect locations, organizations, people, etc.
56
Other usages
o Data sparseness for machine learning: hapaxes can be replaced by
semantic classes
o Use redundancy for more robustness: spelling correction and
speech recognition can built semantic expectations using Wordnet
and make better choices
o Sentiment and opinion mining, sentiment classification
o For Serbian (SAFOS)
o Vocabulary learning
57
Wordnets in the World
58
Global WordNet
Global WordNet Association - http://globalwordnet.org/
o A free, public and non-commercial organization that provides a platform for
discussing, sharing and connecting wordnets for all languages in the world.
o Organizes GWA Conferences – 8 conferences up to now
o Global WordNet Grid - which is being build around a shared set of concepts
used in many wordnet projects.
o List of all wordnets in the world (contact persons, licences etc.
http://globalwordnet.org/wordnets-in-the-world/)
59