Building Wordnets, by Piek Vossen

Download Report

Transcript Building Wordnets, by Piek Vossen

Building Wordnets
Piek Vossen, Irion Technologies
[email protected]
Overview





Starting points
Semantic framework
Process overview
Methodologies in other projects
Multilinguality
Starting points

Purpose of the wordnet database:










education, science, applications
formal ontology or linguistic ontology
making inferences or lexical substitution
conceptual density or large coverage
Distributed development
Reproducability
Available resources
Language-specific features
(Cross-language) compatibility
Exploit cummunity resources by projecting
conceptual relations on a target wordnet
Semantic framework

Differences in wordnet structures
Wordnet1.5
Dutch Wordnet
voorwerp
{object}
object
artifact, artefact
(a man-made object)
block
natural object (an
object occurring
naturally)
instrumentality
implement
body
werktuig
{tool}
lichaam
{body}
device
container
tool
instrument
box
blok
{block}
spoon
bag
bak
{box}
lepel
{spoon}
- Artificial Classes versus Lexicalized Classes:
instrumentality; natural object
- Lexicalization differences of classes:
container and artifact (object) are not lexicalized in Dutch
tas
{bag}
Linguistic versus conceptual ontologies
 Conceptual ontology:
 A particular level or structuring may be required to achieve a better
control or performance, or a more compact and coherent structure.
 Introduce artificial levels for concepts which are not lexicalized in a
language (e.g. instrumentality, hand tool),
 Neglect levels which are lexicalized but not relevant for the purpose
of the ontology (e.g. tableware, silverware, merchandise).
 What properties can we infer for spoons?
spoon -> container; artifact; hand tool; object; made of metal or plastic;
for eating, pouring or cooking
 Linguistic ontology:
 Exactly reflects the relations between all the lexicalized words and
expressions in a language.
Valuable information about the lexical capacity of languages: what is the
available fund of words and expressions in a language.
 What words can be used to name spoons?
spoon -> object, tableware, silverware, merchandise, cutlery,
Wordnets as Linguistic Ontologies
Main purpose is to predict what words can be used as substitutes in language,
considering all the lexicalized words in a language.
Classical Substitution Principle:
Any word that is used to refer to something can be replaced by its synonyms,
hyperonyms and
hyponyms:
horse

stallion, mare, pony, mammal, animal, being.
It cannot be referred to by co-hyponyms and co-hyponyms of its hyperonyms:
horse
X
cat, dog, camel, fish, plant, person, object.
Conceptual Distance Measurement:
Number of hierarchical nodes between words is a measurement of closeness,
where the level and the local density of nodes are additional factors.
Define a semantic framework

Definition of relations



Diagnostic frames (Cruse 1986)
Examples and corpus data
Top-level ontology



Constraints on relations
Type consistency
Large scale validation
Process overview
Techniques


Manual encoding and verification
Automatic extraction:








definitions
synonyms
distribution and similarity patterns in copora
defining contexts, e.g. “cats and other pets”
parallel corpora, e.g. bible translations
morphological structure
bilingual dictionaries
Encode source and status of data:

who, when, based on what algorithm, validated, final
Encoding cycle

1. Collecting data



2. Encoding data:






Vocabulary: what is the list of words of a language?
Concepts: what is the list of concepts related to the
vocabulary?
Defining synsets
Defining language internal relations: hyponymy, meronymy
roles, causal relations
Defining equivalence relations to English
Defining other relations,e.g. Ontology types, Domains
3. Validation
4. Go to 1.
Where to start?

How to get a first selection:



Words (alphabetic, frequency) -> concepts -> relations
Concept (hyperonym, domain, semantic feature) -> words > concepts -> relations
How to get a complete overview of words and
expressions that belong to a segment of a wordnet?




Up to 20 hyperonyms for instrumentality: instrument,
instrumentality, means, tool, device, machine, apparatus,
....
iterative process: collect, structure, collect, restructure...
using multiple sources of evidence
comparing results, e.g. tri-cycle is a toy or a vehicle
Synonymy as a basis?




Synsets are the core unit of a wordnet database
Synonymy is only vaguely defined: substitution in a
context.
Synonyms are very hard to detect
Other relations (role relations, causal relations):




easier to detect and encode
easier to validate within a formal framework
easier to validate in a corpus
Rich set of relations per concept help alignment with
other resources
Diagnostic frames and examples
Agent Involvement
(A/an) X is the one/that who/which does the Y, typically intentionally.
Conditions:
- X is a noun
- Y is a verb in the gerundive form
Example:
A teacher is the one who does the teaching intentionally
Effect:
{to teach} (Y) INVOLVED_AGENT {teacher} (X)
Patient Involvement
(A/an) X is the one/that who/which undergoes the Y
Conditions:
- X is a noun
- Y is a verb in the gerundive form
Example:
A learner is the one who undergoes the learning
Effect:
{to learn} (Y) INVOLVED_PATIENT {learner} (X)
Diagnostic frames and examples
Result Involvement
A/an) X is comes into existence as a result of Y, where X is a noun
and Y is a verb in the gerundive form and a hyponym of “make”,
“produce”, “generate”.
Example:
A crystal comes into existence as a result of crystalizing
A crystal is the result of crystalizing
A crystal is created by crystalizing
Effect:
{to crystalize} (Y) INVOLVED_RESULT {crystal} (X)
Comments:
 Special kind of patient relation. The entity is not jut changed or
affected but it comes into existence as a result of the event:
 Only applies to concrete entities (1stOrder) or mental objects such
as ideas (3rdOrder).
 Situations that result from other situations are related by the CAUSE
relation.
Hyponymy overloading
(Guarino 1998, Vossen and Bloksma 1998).

The vocabulary does not clearly differentiate
between orthogonal roles and disjoint types:



role: passenger, teacher, student
type: dog; cat
?:




knife ->weapon, cutlery; spoon -> container, cutlery
food
material <- building material <-?- stone; <-?-water; <- brick;
Disjunctive and conjunctive hyperonyms:


albino -> animal or plant
spoon -> cutlery & container
Hyponymy restructuring
ziekte (disease)
ingewandsziekte
dierenziekte
infectieziekte
(bowel disease)
(animal disease)
(infectious disease)
haringwormziekte
(anisakiasis: bowel
kolder
(staggers: brain
disease of herrings)
disease of cattle)
veeziekte
(cattle disease)
vuilbroed
(infectious infectious
disease of bees)
Methodologies in a number of projects


Princeton Wordnet
EuroWordNet:



English, Dutch, German, French, Spanish, Italian,
Czech, Estonian
10,000 up to 50,000 synsets
BalkaNet:


Romanian, Bulgarian, Turkish, Slovenian, Greek,
Serbian
10,000 synsets
Main strategies for building wordnets

Expand approach: translate WordNet synsets to another
language and take over the structure
 easier and more efficient method
 compatible structure with WordNet
 vocabulary and structure is close to WordNet but also biased
 can exploit many resources linked to Wordnet: SUMO, Wordnet
domains, selection restriction from BNC, etc...

Merge approach: create an independent wordnet in another
language and align it with WordNet by generating the appropriate
translations
 more complex and labor intensive
 different structure from WordNet
 language specific patterns can be maintained, i.e. very precise
substitution patterns
Aligning wordnets
English wordnet
object
Dutch wordnet
artifact object natural object
instrument
muziekinstrument
musical instrument
? orgel
orgel
organ ?
organ
? orgaan
hammond orgel
hammond organ
organ
General criteria for approach:




Maximize the overlap with wordnets for other
languages
Maximize semantic consistency within and
across wordnets
Maximally focus the manual effort where
needed
Maximally exploit automatic techniques
Top-down methodology




Develop a core wordnet (5,000 synsets):
 all the semantic building blocks or foundation to define the
relations for all other more specific synsets, e.g. building ->
house, church, school
 provide a formal and explicit semantics
Validate the core wordnet:
 does it include the most frequent words?
 are semantic constraints violated?
Extend the core wordnet: (5,000 synsets or more):
 automatic techniques for more specific concepts with highconfidence results
 add other levels of hyponymy
 add specific domains
 add ‘easy’ derivational words
 add ‘easy’ translation equivalence
Validate the complete wordnet
Developing a core wordnet







Define a set of concepts(so-called Base Concepts) that play an
important role in wordnets:
 high position in the hierarchy & high connectivity
 represented as English WordNet synsets
 Common base concepts: shared by various wordnets in different
languages
 Local base concepts: not shared
EuroWordNet: 1024 synsets, shared by 2 or more languages
BalkaNet: 5000 synsets (including 1024)
Common semantic framework for all Base Concepts, in the form of a
Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to
synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base
Concepts
All 13 Wordnets are developed from a similar semantic core closely
related to the English Wordnet
Top-down methodology
Top-Ontology
Hypero
nyms
Local
BCs
WMs
related via
non-hypo
nymy
CBC
Representatives
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
63TCs
1024 CBCs
Remaining
WordNet1.5
Synsets
Inter-Lingual-Index
CBC
Representa.
Local
BCs
First Level Hyponyms
Remaining
Hyponyms
WMs
related via
non-hypo
nymy
Advantages of the approach

Well-defined semantics that can be inherited
down to more specific concepts





Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are
covered
High overlap and compatibility with other
wordnets
Manual effort is focussed on the most difficult
concepts and words
Distribution over the top ontology clusters
WN
NL
ES
IT
Top-Concept
TC%of
TC- % of %of TC- %of es %of TC- %of it %of
Tokens wn Tokens nl
wn Tokens
wn Tokens
wn
14068 3.99% 1193 0.97% 8.5% 2458 1.81% 17.5% 1122 1.44% 8.0%
Animal
19562 5.55% 10803 8.83% 55.2% 9969 7.36% 51.0% 6494 8.34% 33.2%
Artifact
1022 0.29%
707 0.58% 69.2%
628 0.46% 61.4%
434 0.56% 42.5%
Building
3377 0.96% 1393 1.14% 41.2% 1614 1.19% 47.8%
624 0.80% 18.5%
Comestible
1725 0.49%
778 0.64% 45.1%
799 0.59% 46.3%
432 0.55% 25.0%
Container
2030 0.58% 1208 0.99% 59.5% 1027 0.76% 50.6%
690 0.89% 34.0%
Covering
664 0.19%
159 0.13% 23.9%
254 0.19% 38.3%
27 0.03% 4.1%
Creature
34081 9.68% 17668 14.44% 51.8% 18904 13.96% 55.5% 11043 14.18% 32.4%
Function
298 0.08%
171 0.14% 57.4%
147 0.11% 49.3%
87 0.11% 29.2%
Furniture
756 0.21%
494 0.40% 65.3%
426 0.31% 56.3%
292 0.37% 38.6%
Garment
93 0.03%
67 0.05% 72.0%
62 0.05% 66.7%
49 0.06% 52.7%
Gas
27805 7.90% 3357 2.74% 12.1% 3630 2.68% 13.1% 2337 3.00% 8.4%
Group
11543 3.28% 6372 5.21% 55.2% 7683 5.67% 66.6% 4488 5.76% 38.9%
Human
780 0.22%
412 0.34% 52.8%
426 0.31% 54.6%
294 0.38% 37.7%
ImageRepresentation
7036 2.00% 4102 3.35% 58.3% 3590 2.65% 51.0% 2564 3.29% 36.4%
Instrument
2844 0.81% 1273 1.04% 44.8% 1218 0.90% 42.8%
691 0.89% 24.3%
LanguageRepresent.
1629 0.46%
617 0.50% 37.9%
500 0.37% 30.7%
339 0.44% 20.8%
Liquid
47104 13.37% 10225 8.36% 21.7% 13661 10.08% 29.0% 7408 9.51% 15.7%
Living
Wordnet
Domains
acoustics
Concepts
Proportion
Wordnet
Domains
Concepts
Proportion
104
0.092%
linguistics
1545
1.363%
2974
2.624%
literature
686
0.605%
aeronautic
154
0.136%
mathematics
575
0.507%
agriculture
306
0.270%
mechanics
532
0.469%
28
0.025%
medicine
2690
2.374%
2705
2.387%
merchant_navy
485
0.428%
896
0.791%
meteorology
231
0.204%
applied_science
28
0.025%
metrology
1409
1.243%
archaeology
68
0.060%
military
1490
1.315%
5
0.004%
money
624
0.551%
architecture
255
0.225%
mountaineering
28
0.025%
art
420
0.371%
music
985
0.869%
artisanship
148
0.131%
mythology
314
0.277%
astrology
17
0.015%
number
220
0.194%
astronautics
29
0.026%
numismatics
43
0.038%
376
0.332%
occultism
52
0.046%
22
0.019%
oceanography
10
0.009%
administration
alimentation
anatomy
anthropology
archery
astronomy
athletics
EWN Interlingual Relations
•
EQ_SYNONYM: there is a direct match between a synset and an ILI-record
•
EQ_NEAR_SYNONYM: a synset matches multiple ILI-records simultaneously,
•
HAS_EQ_HYPERONYM: a synset is more specific than any available ILI-record.
•
HAS_EQ_HYPONYM: a synset can only be linked to more specific ILI-records.
•
other relations:
CAUSES/IS_CAUSED_BY, EQ_SUBEVENT/EQ_ROLE,
EQ_IS_STATE_OF/EQ_BE_IN_STATE
Multilinguality
Complex equivalence relations
eq_near_synonym
1. Multiple Targets
One sense for Dutch schoonmaken (to clean) which simultaneously matches with
at least 4 senses of clean in WordNet1.5:
•{make clean by removing dirt, filth, or unwanted substances from}
•{remove unwanted substances from, such as feathers or pits, as of chickens or
fruit}
•(remove in making clean; "Clean the spots off the rug")
•{remove unwanted substances from - (as in chemistry)}
The Dutch synset schoonmaken will thus be linked with an eq_near_synonym
relation to all these sense of clean.
2. Multiple Source meanings
Synsets inter-linked by a near_synonym relation can be linked to same target ILIrecord(s), either with an eq_synonym or an eq_near_synonym relation:
Dutch wordnet: toestel near_synonym apparaat
ILI-records:
{machine}; {device}; {apparatus}; {tool}
Complex equivalence relations
has_eq_hyperonym
Typically used for gaps in WordNet1.5 or in English:
• genuine, cultural gaps for things not known in English culture, e.g. citroenjenever, which
is a kind of gin made out of lemon skin,
• pragmatic, in the sense that the concept is known but is not expressed by a single
lexicalized form in English, e.g.: Dutch hoofd only refers to human head and Dutch kop
only refers to animal head, English uses head for both.
has_eq_hyponym
Used when wordnet1.5 only provides more narrow terms. In this case there can only be a
pragmatic difference, not a genuine cultural gap, e.g.: Spanish dedo can be used to refer to both
finger and toe.
Overview of equivalence relations to the ILI
Relation
POS
Sources: Targets
Example
eq_synonym
same
1:1
auto : voiture
car
apparaat, machine, toestel:
apparatus, machine, device
citroenjenever:
gin
dedo :
toe, finger
universiteit, universiteitsgebouw:
university
raken (cause), raken:
hit
schoonmaken :
clean
eq_near_synonym any
many : many
eq_hyperonym
same
many : 1 (usually)
eq_hyponym
same
(usually) 1 : many
eq_metonymy
same
many/1 : 1
eq_diathesis
same
many/1 : 1
eq_generalization same
many/1 : 1
Filling gaps in the ILI
Types of GAPS
1.
genuine, cultural gaps for things not known in English
culture, e.g. citroenjenever, which is a kind of gin made out of
lemon skin,
•
•
pragmatic, in the sense that the concept is known but is not
expressed by a single lexicalized form in English, e.g.:
container, borrower, cajera (female cashier)
2.
•
•
3.
Non-productive
Non-compositional
Productive
Compositional
Universality of gaps: Concepts occurring in at least 2
languages
Productive and Predictable Lexicalizations
exhaustively linked to the ILI
{doodslaanV}NL
hypernym
beat
hypernym
{totschlagenV}DE
kill
{doodstampenV}NL
hypernym
hypernym
{tottrampelnV}DE
stamp
hypernym
{doodschoppenV}NL
{casière}NL
kick
hypernym
cashier
in_state
female
fish
young
hypernym
in_state
{cajeraN}ES
hypernym
in_state
{alevínN}ES
Top-down methodology
Hyper
nyms
=
Sumo
Ontology
EuroWordNet
BalkaNet
Base Concepts
WordNet
Domains
Named
Entities
Core wordnet
5000 synsets
1000
Synsets
5000
Synsets
English
Arabic
Lexicon
WordNet
WordNet
Synsets
Synsets
1045678-v
{teach}
WordNet
Synsets
teach
darrasa
Easy
Translations
English Wordnet
SBCCBC
ABC
Next Level
WordNet
Hyponyms
Synsets
More
1045678-v
Hyponyms
{darrasa}
Domain
Named
Domain Entities
“chemics”
Arabic Wordnet
Arabic
word
frequency
Arabic
roots
&
derivation
rules
Top-down methodology
Hyper
nyms
=
Sumo
Ontology
EuroWordNet
BalkaNet
Base Concepts
WordNet
Domains
Named
Entities
1000
Synsets
SBC
CBC
5000
Synsets
English
Arabic
Lexicon
WordNet
Synsets
WordNet
Synsets
ABC
Next Level
Hyponyms
More
Hyponyms
Easy
Translations
English Wordnet
Domain
Named
Domain Entities
“chemics”
Arabic Wordnet
Arabic
word
frequency
Arabic
roots
&
derivation
rules
ziekte (disease)
ingewandsziekte
dierenziekte
infectieziekte
(bowel disease)
(animal disease)
(infectious disease)
haringwormziekte
(anisakiasis: bowel
kolder
(staggers: brain
disease of herrings)
disease of cattle)
veeziekte
(cattle disease)
vuilbroed
(infectious infectious
disease of bees)
ziekte
(disease)
ingewandsziekte
dierenziekte
infectieziekte
(bowel disease)
(animal disease)
(infectious disease)
haringwormziekte
(anisakiasis: bowel
veeziekte
vuilbroed
(infectious infectious
(cattle disease)
disease of herrings)
disease of bees)
kolder
(staggers: brain disease of cattle)
Resources

Monolingual dictionaries:







definitions
synonym relations
other relations
Bi-lingual dictionaries: L-English, English-L
Ontologies
Thesauri
Corpora:


monolingual
parallel