Transcript Document

Thesauruses for Natural
Language Processing
Adam Kilgarriff
Lexicography MasterClass
and
University of Brighton
Outline
Definition
 Uses for NLP
 WASPS thesaurus
 web thesauruses
 Argument: words not word senses
 Evaluation proposals
 Cyborgs

What is a thesaurus?
a resource that groups words
according to similarity
Manual and automatic

Manual
– Roget,

WordNets, many publishers
Automatic
– Sparck Jones (1960s), Grefenstette (1994), Lin
(1998), Lee (1999)
– aka distributional
– two words are similar if they occur in same
contexts

Are they comparable?
Thesauruses in NLP

sparse data
Thesauruses in NLP

sparse data

does x go with y?
– don’t know, they have never been seen together

New question:
does x+friends go with y+friends
– indirect evidence for x and y
– thesaurus tells us who friends are
– “backing off”
Relevant in:

Parsing
– PP-attachment
– conjunction scope





Bridging anaphors
Text cohesion
Word sense disambiguation (WSD)
Speech understanding
Spelling correction
Speech understanding
He’s as headstrong as an alleg***** in the
upwaters of the Yangtze
Speech understanding
He’s as headstrong as an alleg***** in the
upwaters of the Yangtze

allegory?
Speech understanding
He’s as headstrong as an alleg***** in the
upwaters of the Yangtze

allegory?
 alligator?
Speech understanding
He’s as headstrong as an alleg***** in the
upwaters of the Yangtze

allegory? in upwaters? No
 alligator? in upwaters? No
Speech understanding
He’s as headstrong as an alleg***** in the
upwaters of the Yangtze

allegory? in upwaters? No
 alligator? in upwaters? No
 allegory+friends in upwaters? No
 alligator+friends in upwaters? Yes
PP-attachment
investigate stromatolite with microscope/speckles
– microscope: verb attachment
– speckles: noun attachment
inspect jasper with spectrometer
– which?
PP attachment (cont)

compare frequencies of
– <inspect, with, spectrometer>
– <jasper, with, spectrometer>
PP attachment (cont)

compare frequencies of
– <inspect, with, spectrometer>
– <jasper, with, spectrometer>

both zero? Try
– <inspect+friends, with,
spectrometer+friends>
– <jasper+friends, with,
spectrometer+friends>
Conjunction scope

Compare
– old boots and shoes
– old boots and apples
Conjunction scope

Compare
– old boots and shoes
– old boots and apples

Are the shoes old?
Conjunction scope

Compare
– old boots and shoes
– old boots and apples
Are the shoes old?
 Are the apples old?

Conjunction scope

Compare
– old boots and shoes
– old boots and apples
Are the shoes old?
 Are the apples old?
 Hypothesis:

– wide scope only when words are similar
Conjunction scope

Compare
– old boots and shoes
– old boots and apples
Are the shoes old?
 Are the apples old?
 Hypothesis:

– wide scope only when words are similar

hard problem: thesaurus might help
Bridging anaphor resolution
– Maria bought a large apple. The fruit was
red and crisp.

fruit and apple co-refer
Bridging anaphor resolution
– Maria bought a large apple. The fruit was
red and crisp.
fruit and apple co-refer
 How to find co-referring terms?

Text cohesion

words on same theme
– same segment

change in theme of words
– new segment

same theme: same thesaurus class
Word Sense Disambiguation
(WSD)

pike: fish or weapon
– We caught a pike this afternoon

probably no direct evidence for
– catch pike

probably is direct evidence for
– catch {pike,carp,bream,cod,haddock,…}
WordNet, Roget

widely used for all the above
The WASPS thesaurus
– credit: David Tugwell
– EPSRC grant K8931

POS-tag, lemmatise and parse the BNC
(100M words)
 Find all grammatical relations
– <obj, climb, bank>
– <modifier, big, bank>
– <subject, bank, refuse>

70 million triples
WASPS thesaurus (cont)

Similarity:
– <obj, drink, beer>
– <obj, drink, wine>

one point similarity between beer and wine
 count all points of similarity between all pairs
of words
 weight according to frequencies
– product of MI: Lin (1998)
Word Sketches
one-page summary of a word’s
grammatical and collocational behaviour
 demo: http://wasps.itri.bton.ac.uk
 the Sketch Engine

– input any corpus
– generate word sketches and thesaurus
– just available now
Nearest neighbours to zebra
Nearest neighbours
zebra: giraffe buffalo hippopotamus rhinoceros
gazelle antelope cheetah hippo leopard
kangaroo crocodile deer rhino herbivore
tortoise primate hyena camel scorpion
macaque elephant mammoth alligator
carnivore squirrel tiger newt chimpanzee
monkey
exception: exemption limitation exclusion
instance modification restriction recognition
extension contrast addition refusal example
clause indication definition error restraint
reference objection consideration concession
distinction variation occurrence anomaly offence
jurisdiction implication analogy
pot: bowl pan jar container dish jug mug tin tub
tray bag saucepan bottle basket bucket vase
plate kettle teapot glass spoon soup box can
cake tea packet pipe cup
VERBS
measure
determine assess calculate decrease monitor
increase evaluate reduce detect estimate indicate
analyse exceed vary test observe define record
reflect affect obtain generate predict enhance alter
examine quantify relate adjust
boil
simmer heat cook fry bubble cool stir warm steam
sizzle bake flavour spill soak roast taste pour dry
wash chop melt freeze scald consume burn mix
ferment scorch soften
ADJECTIVES
hypnotic
haunting piercing expressionless dreamy
monotonous seductive meditative emotive
comforting expressive mournful healing indistinct
unforgettable unreadable harmonic prophetic
steely sensuous soothing malevolent irresistible
restful insidious expectant demonic incessant
inhuman spooky
pink
purple yellow red blue white pale brown green
grey coloured bright scarlet orange cream black
crimson thick soft dark striped thin golden faded
matching embroidered silver warm mauve damp
Nearest neighbours
crane
winch
swan
heron
winch
crane
heron
tern
heron
mast
crane
gull
tractor
rigging
gull
swan
truck
pump
tern
crane
swan
tractor
curlew
flamingo
no clustering (tho’ could be done)
 no hierarchy (tho’ could be done)
 rhythm
 all on the web:
http://wasps.itri.bton.ac.uk

– registration required
The web

an enormous linguist’s playground
– Computational Linguistics Special Issue,
Kilgarriff and Grefenstette (eds) 29 (3)
• (coming soon)
Google sets
http://labs.google.com/sets
 Input: zebra giraffe buffalo

Google sets
http://labs.google.com/sets
 Input: zebra giraffe buffalo
 kudu hyena impala leopard hippo
waterbuck elephant cheetah eland

Google sets
http://labs.google.com/sets
 Input: harbin beijing nanking

Google sets
http://labs.google.com/sets
 Input: harbin beijing nanking
 Output: shanghai chengdu guangzhou
hangzhou changchun zhejiang kunming
dalian jinan fuzhou

Tree structure

Roget
– all human knowledge
as tree structure
– 1000 top categories
• subdivisions
– like this
» etc
» etc
Directories and thesauruses
Yahoo, http://www.yahoo.com
 Open directory project, http://dmoz.org

– all human activity as tree structure
plus corpus at every node
– gather corpus, identify domain vocabulary
• Gonzalo and colleagues, Madrid, CL Special
Issue
• Agirre and colleagues, ‘topic signatures’
Words and word senses

automatic thesauruses
– words
Words and word senses

automatic thesauruses
– words

manual thesauruses
– simple hierarchy is appealing
– homonyms
Words and word senses

automatic thesauruses
– words

manual thesauruses
– simple hierarchy is appealing
– homonyms
– “aha! objects must be word senses”
Problems
Theoretical
 Practical

Theoretical
Wittgenstein
Don’t ask for the
meaning,
ask for the use
Practical
Problems

Practical
– a thesaurus is a tool
– if the tool organises words senses you must do
WSD before you can use it
– WSD: state of the art, optimal conditions: 80%
.
Problems

Practical
– a thesaurus is a tool
– if the tool organises words senses you must do
WSD before you can use it
– WSD: state of the art, optimal conditions: 80%
“To use this tool, first replace one fifth of your
input with junk”
Avoid word senses
Avoid word senses

This word has three meanings/senses
Avoid word senses
This word has three meanings/senses
 This word has three kinds of use

– well founded
– empirical
– we can study it
sorry, roget
sorry, AI
sorry, AI

AI model for NLP:
–
–
–
–
NLP turns text into meanings
AI reasons over meanings
word meanings are concepts in an ontology
a Roget-like thesaurus is (to a good
approximation) an ontology
– Guarino: “cleansing” WordNet

If a thesaurus groups words in their various
uses (not meanings)
– not the sort of thing AI can reason over
sorry, AI

“linguistics expressions prompt for
meanings rather than express
meanings”
– Fauconnier and Turner 2003
It would be nice if …
 But …

Evaluation

manual thesauruses
– not done

automatic thesauruses: attempts
– pseudo-disambiguation (Lee 1999)
– with ref to manual ones (Lin 1998)
Task-based evaluation
Task-based evaluation

Parsing
– PP-attachment
– conjunction scope





Bridging anaphors
Text cohesion
Word sense disambiguation (WSD)
Speech understanding
Spelling correction

What is performance at the task
– with no thesaurus
– with Roget
– with WordNet
– with WASPS
Plans
set up evaluation tasks
 theseval
 web-based thesaurus

– Open Directory Project hierarchies

campaign
Cyborgs
Robots: will they take over?
 Rod Brooks’s answer:

– Wrong question: greatest advances are in
what the human+computer ensemble can
do
Cyborgs

A creature that is partly human and
partly machine
– Macmillan English Dictionary
Cyborgs and the Information
Society
The thedsaurus-making agent is
part human (for precision), part
computer (for recall).
Summary: Thesauruses for NLP
Definition
 Uses for NLP
 WASPS thesaurus
 web thesauruses
 Argument: words not word senses
 Evaluation proposals
 Cyborgs

Thesaurus-makers of the future?