Building of the Polish Wordnet The First Steps of The

Download Report

Transcript Building of the Polish Wordnet The First Steps of The

plWordNet
as the Cornerstone of a Toolkit
of Lexico-semantic Resources
Marek Maziarz, Maciej Piasecki,
Ewa Rudnicka, Stanisław Szpakowicz*
G4.19 Research Group, Institute of Informatics
Wrocław University of Technology
* School of Electrical Engineering and Computer Science
University of Ottawa
www.plwordnet.pwr.wroc.pl
Wordnet as a Lexical Resource
• Princeton WordNet defines de facto
standard
– large size and coverage
– open access
– thousands of applications
• Applications:
dictionary vs knowledge representation
• Range of description
• Ideal size and natural development limits
plWordNet model: linguistic resource
• Wordnet vs ontology
–
–
–
–
–
O: a strict knowledge representation
W: concepts expressed entirely in a natural language
W: synonymy is a matter of degree
O: certainty and a rigorous construction
W: shaped by the lexico-semantic dependencies
• Alternative to formalisation
– Corpus analysis and substitution tests
– Minimal commitment: defining lexico-semantic
relations without committing to any particular theory
of lexical semantic or human cognition
plWordNet model:
corpus-based development
• Main source of lexical knowledge: a very
large monolingual corpus
– tools for corpus browsing
– semi-automatic knowledge extraction
• Additional sources: dictionaries and
encyclopedias
• Lexical unit
– lemma-sense pair
– a linguistically motivated primitive
plWordNet model: synset definition
• Synsets
– groups of lexical units sharing certain relations
{afekt 1 `passion’, uczucie 2 `feeling’} hypernym
{miłość 1 `love’, umiłowanie 1 `affection’ , kochanie 1 ~`loving’}
• Constitutive relations
– fairly frequent (to describe many LUs)
– shared among LUs (to define groups)
– grounded in the linguistic tradition (to facilitate their consistent
understanding)
– used in other wordnets (to improve compatibility)
plWordNet model:
non-relational aspects
• Constitutive features
– stylistic registers,
– verb aspect
– and semantic verb classes
• Referred to in the relation definitions
– e.g. relations limited to verbs of the same
aspect and semantic class
• Glosses helps wordnet editors
• Usage examples: direct links to the corpus
Relation density
• Synset relation density in PWN 3.1 and in plWordNet 2.0
4.5
4
3.5
PWN
3.99
3.51
3.54
3.11
3.06
3
2.5
plWN
2.21
2
2.43
1.56
1.5
1
0.5
0
nouns
verbs
adjectives
total
Size matters: lexical coverage
Coverage of PWN/plWN for lemmas of different frequency
in two similar 1.2G words corpora (Wikipedia)
70%
60%
plWN
45.6%
50%
40%
PWN
58.3%
38.3%
35.0%
28.0%
30%
27.7%
17.0%
20%
10%
21.0%
10.7%
6.4%
0%
≥1000
≥500
≥200
≥100
≥50
Corpus frequency
Size matters: plWordNet 2.2
POS
Synsets
Lemmas
LUs
Average
synset
Nouns
102 613
105 883
140 701
1.37
Verbs
21 897
17 554
32 180
1.47
Adjectives
15 145
11 677
18 787
1.24
139 656
135 115
191 669
1.37
All
www.plwordnet.pwr.wroc.pl
plWordNet: ongoing work
Size matters: comparison of wordnets
250000
200000
150000
100000
50000
0
synsets
plWN 2.2
lemmas
PWN 3.1
lexica units
GermaNet
How many words are there?
- existing dictionaries
● Woordenboek der Nederlandsche Taal
● dictionary of Grimm brothers
330k lemmas
● Oxford English Dictionary
300k lemmas
● `Warsaw’ Polish Dictionary
280k lemmas
● contemporary Polish dictionaries
130k lemmas
unabridged dictionaries
430k lemmas
How many words are there?
- approximation
N10+ [x 10^3 words]
200
~174k (10+ lemmas)
plWordNet
corpus
150
Bank of English
(2001)
100
N10+ = 6,67 𝑐𝑜𝑟𝑝𝑢𝑠 𝑠𝑖𝑧𝑒
50
Cobuild Bank of
English (1993)
COBUILD data
Cobuild (1986)
0
0
300
600
900
1200
1500
Corpus size [x 10^6 words]
1800
How many words are there?
# entries
Polish dictionaries
100-280k
plWordNet corpus (10+ lemmas) [K]
174k
doubled plWordNet corpus (0+ lemmas) [GT]
+200k
K - Krishnamurthy’s data (2002), GT - Good & Toulmin approximation
(1956)
plWordNet 3.0
200k lemmas
Toolkit of Lexico-semantic Resources
• Lexicon of lexico-syntactic structures of
multi-word expressions
• plWordNet 3.0 (Słowosieć 3.0)
• plWordNet 3.0 to WordNet 3.1 mapping
• Semantic lexicon of proper names
• Mapping to an ontology
• And a valency lexicon linked to plWordNet
Lexicon of multi-word expressions
• Non-trivial morphology of Polish MWEs
– more than 100 nominal structural patterns
• Description of the lexico-syntactic
structures of MWEs
• Multi-word LUs as semantic atoms
– no internal semantic relations
• Dynamic lexicon
– a tool for automatic MWE extraction
– 60 000 described in the lexicon and plWordNet
Lexicon of Proper Names
• PNs are not a part of the lexicon
• PN is an instance of a type
– characterised by referents
– not by their semantic properties
• Linking PNs via a wordnet
– some lexico-syntactic contexts signal instance of
– PNs are represented in wordnets
• PNs as derivational bases for Common Nouns
• Dynamic lexicon with 2.5 milion PNs verified
manually
plWordNet to WordNet 3.1 mapping
• plWordNet: built independently to obtain
faithful description
• Manual mapping
– bottom-up order
– comparison of the relations structures
– a cascading list of Interlingual-relations
• plWordNet verification as an important side
effect
• Present state: 72 000 N and Adj synsets mapped
• Target: complete plWordNet 3.0 mapped
Wordnet editor: WordnetLoom
WordnetLoom: editing the mapping
Mapping to ontology
• Ontology: unambiguous concepts defined
formally
• Lexical meanings
– imprecisely delimited
– constrained by usage, stylistic register and sentiment
• Mapping to ontology
– precise, formal description for meanings
– association: concepts – their lexical embodiment
• SUMO selected
– Princeton WordNet mapping
– Semi-automated mapping of plWordNet
Expectations
Valence lexicon
MWE lexicon
plWordNet 3.0
describes
WordNet 3.1 +
extension
Proper Names
Ontology: SUMO + intermediate level
Applications
• Strong universal basis
– a comprehensive wordnet >200 000 lemmas
resulting in ~285 000 LUs and ~210 000 synsets
– one of the largest ever Polish dictionaries
• Modularly constructed toolkit
– a layered architecture of large software systems
– separate but linked layers
– each layer based on limited set of notions and
principles and exchangeable
• The core of the CLARIN-PL language technology
infrastructure
Thank-you
www.plwordnet.pwr.wroc.pl
Thank you!