Dictionaries and thesauri

Download Report

Transcript Dictionaries and thesauri

Dictionaries
See
Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.)
The Oxford Handbook of Computational Linguistics, Oxford:
OUP, 2004.
Dictionaries/Lexicons
•
•
•
•
•
Lexicography and the computer
Corpus-based lexicography
MRDs
Dictionaries for NLP
Thesauri: structured lexicons
2/18
Computational lexicography
• Restructuring and exploiting human dictionaries
for use by computer programs
• Using computational techniques to compile
(new) dictionaries
• Focus on English (and other well established
languages)
• Significant different issues for other languages,
especially
– Alphabetization and arrangement
– Compilation from scratch for previously unstudied
languages
3/18
Human dictionaries
• Traditional view of what a “dictionary” is
– List of words, arranged (usually) alphabetically
– Inclusion in dictionary lends authority, even proscriptively
– Entry typically gives
•
•
•
•
•
•
spelling ... alternate spellings
POS, morphology (if irregular)
core definition (using defining vocab?)
pronunciation (using own transcription)
etymology
examples of usage
– as justification for inclusion
– as illustration of use (esp. learner’s dictionaries)
– Entry typically doesn’t give
•
•
•
•
•
help with spelling
morphology (if regular), especially derivational
subcategorization information
contrastive examples of use
indications of possible metaphorical extensions to meaning
4/18
Human dictionaries
• Historically
– bilingual dictionaries for translators
– monolingual dictionary as (pre/proscriptive) definition
of language, often polemical
– OED (1884-1928) first dictionary on purely descriptive
principle, relying on citations
• Deficiencies and difficulties
– What to include? (neologisms, slang)
– Inclusion of names
– Differentiating senses
5/18
Differentiating word senses
• Dictionaries disagree widely
• Probably no right answer
• General principles (look for excuse to split vs
look for reason to lump)
• Keep related words of different POS together?
• Etymology can be misleading (eg crane, pupil)
• Metaphorical extension of original meaning –
how far do you go? (eg rose, bar)
• Purpose of dictionary may help decide, eg
translation
6/18
Citations
• Senses and uses identified by collecting
examples of use
– Sent in on “slips” by informants
– Lexicographer’s job is to collate these
• Criteria for a new word (or new meaning)
– Number of citations
– Source of citations
– Veracity of use
7/18
Corpus-based dictionaries
• A collection of texts, usually collected with
a specific purpose in mind
• British National Corpus, attempt to capture
a synchronic picture of BrE of the late
1980s (100m words)
• COBUILD “Bank of English” dynamic
“monitor” corpus used to help
lexicographers identify/define usage
8/18
Machine-readable dictionaries
• “Machine” means “computer”
• Dictionary stored in a format which makes
it manipulable on a computer
• Originally, derived from MR version of print
dictionary (from type-setter’s tapes)
• Now the other way round: data stored as a
database from which hard copy can be
printed (inter alia)
9/18
MRDs - advantages
• Flexibility of access and presentation
–
–
–
–
Not bound to alphabetical listing
Information presented can be filtered
Can be searched as a database
Different versions (for different users, serving different
purposes) can be produced
• Increased storage capacity
– More information can be stored, especially
• Implicit information can be made explicit
• More examples, including “negative data”
10/18
Lexicons for NLP
• Have to state everything we need to know about the
word
–
–
–
–
–
Phonology: stress pattern, possible weak forms
Orthography: spelling alternatives, hyphenation
Morphology: inflectional paradigms, even if regular
Information about derivations
Syntax: Explicit information about subcategorization and
• eg syntactic/semantic features of arguments
•
Any special interpretation of tenses
– Lexical combinatorics: compounds, idioms
– Semantics: definition, semantic features, semantic relations
– Pragmatics: register, collocation, connotation
11/18
Lexicons for NLP - example
• Information about derivations
• Agentive derivation (-er) is very productive
– Usually means the actor doing the action of a verb,
e.g. swimmer, dancer, killer
– Not available for some verbs, e.g. *knower, *cycler,
*sayer though cf soothsayer, *hoper
– May have a specialised meaning instead of or as well
as the derived meaning, e.g. revolver, computer,
washer, hitter
– In some cases can mean the object undergoing the
action (via ergative use of verb), e.g. taster
12/18
Subcategorization
• Words are assigned to categories (ie parts
of speech, POS), eg noun, verb
– on basis of form, meaning, use
• Syntactic behaviour is predictable from (or
determined by) category
• Within a category there are subcategories
with specific patterns of behaviour, both
syntactic and semantic, e.g.
– transitive/intransitive verb  direct object?
passivize?
13/18
Subcategorization
• Subcat frames indicate complement patterns
and preferences, e.g.
– subj, obj, double obj, prep-obj, infinitival complement,
that complement etc
– semantic features of complements, eg obj of eat
normally edible
• Subcat information can help to disambiguate
– cf He told [ the man ][where the body was buried ].
– He found [ the place [ where the body was buried ]] .
• Much of this info can be captured in general
rules
14/18
• Have to state everything we need to know
about the word, though not necessarily
explicitly
– There can be rules to capture inheritance of
properties, e.g.
• accomplishment + prog tense implies incompletion
• cf She was baking a cake when she dropped dead  no cake
• She was stroking the cat when she dropped dead
15/18
Exploiting human dictionaries in NLP
• In all NLP applications, lexicon is major bottleneck
• Availability of MRD versions of human dictionaries
provided possible solution
– Obviously, MRD gives list of words, and some information
– Extract further information about verb frames by analysing the
examples
– Identify semantic features from definitions
eg a plant which..., a person who...
– Identify hidden arguments
eg to lock = to close sthg using a key
cf He locked the door. The key was heavy.
He emptied his pockets. *The key was heavy.
16/18
Exploiting human dictionaries in NLP
• Generic information about a word and its
usage can be derived from definitions in
which it occurs:
Wine: alcoholic drink made from fermented juices, especially of grapes
Vintage: a season’s yield of wine from a vineyard
Red wine: wine having a red colour derived from the skins of the grapes used ...
Vineyard: an orchard where grapes are grown for the purpose of wine making
Pinot noir: a dry red Californian table wine
Sake: Japanese rice wine
Claret: a dry red Bordeaux or Bordeaux-like wine
Sherry: a sweet white wine from the Jerez region of Spain
Riesling: a dessert wine made from white grapes grown historically in Germany ...
17/18
Corpus-based lexicography revisited
• Similarly, analysis of real examples can
reveal patterns of usage
– Identify primary meaning: not always what
you’d expect (example of reckon)
– Identify possible complementation patterns,
and their relative frequency
18/18
Structured dictionaries
• Special type of dictionary in which words
are grouped together according to their
meaning: thesaurus
• Classic example Roget’s Thesaurus
(1852)
• Structured vocabulary much used in field
of terminology
• Also now a valuable resource for NLP:
Miller’s (Princeton) WordNet (1985)
19/18