pustejovsky - cpa

Download Report

Transcript pustejovsky - cpa

Corpus Linguistics
meets
Lexical Semantic Theory
James Pustejovsky
Brandeis University
University of Pavia
December 15, 2004
Background
•
•
•
Joint work with Patrick Hanks and Anna
Rumshisky
Research funded by NSF
References:
–
–
Pustejovsky, J., P. Hanks, and A. Rumshisky
(2004) “Automated Induction of Sense in
Context”, Proceedings of COLING, Geneva.
Pustejovsky, J. and P. Hanks (2001) “Very Large
Lexical Databases”, Tutorial Notes from ACL,
Toulouse.
2
Outline
•
Corpus Linguistics needs Theory
–
•
Assumptions about Lexicons
–
•
Possible versus Probable Meaning
Encoding Context for a predicate
–
•
Lexicons are for something
Remarks on Lexical Architectures
-
•
Linguistic Theory needs corpus data
Capturing word senses through context
Semantic Induction from Corpora
–
Theory guides clustering
3
Building Lexicons with Corpus Analysis
• Regarding Lexicons:
– Lexicons are for some purpose or task.
– There is no one lexicon but multiple lexicons.
• Regarding Senses (GL, 1995):
– Words have senses, but there are no finite number
of senses independent of the contextualized use of
words in composition
• Words have meaning potentials:
– Words are active objects with functional behavior.
4
What is the relationship between
corpus and lexicon?
• Corpus:
– an accumulation of tokens
• Lexicon:
– an ordered collection of word-types (lemmas), with
data attached.
5
As corpora grow:
• There is a continuing flow of new nouns:
• There are very few new verbs and adjectives:
– but increasing number of contexts for them.
• No new function words.
6
Content of Lexicons for real Applications
• Proper Names:
– humans, locations, institutions, brands, products
• Open class items:
– nouns, verbs, modifiers
• Multiword Expressions
– compounds, idioms, collocations, constructions
7
Things to hang on Words:
•
•
•
•
•
•
•
•
•
•
•
Inflectional forms of the lemma
Phonetic form
Syntactic categorization
Subcategorization
Semantic Type
Typical Contexts (phraseology)
Co-specifications
Implicatures (contextually determined)
Translations
Examples of usage
Probabilities
8
Things to hang on Words:
•
•
•
•
•
•
•
•
•
•
•
Inflectional forms of the lemma
Phonetic form
Syntactic categorization
Subcategorization
Semantic Type
Typical Contexts (phraseology)
Co-specifications
Implicatures (contextually determined)
Translations
Examples of usage
Probabilities
9
Things to hang on Words:
•
•
•
•
•
•
•
•
•
•
•
Inflectional forms of the lemma
Phonetic form
Syntactic categorization
Subcategorization
Semantic Type
Typical Contexts (phraseology)
Co-specifications
Implicatures (contextually determined)
Translations
Examples of usage
Probabilities
10
Lexicon Design should:
• Enable the Possible:
• Be tempered with the Probable:
• Be embedded within a specific application:
– instances of actual.
11
Selectional Features from Corpora
Selection doesn’t specify exactly how a word is
going to behave on all occasions. Rather:
• Selection specifies how words typically
behave:
• The typical is the foundation for forecasting the
probable:
• The probable comes from corpus and the
cospecifications associated with words.
12
Lexical Acquisition
Goals:
- Acquisition of subcategorization using corpus
analytics;
- Learning selectional associations;
- Clustering of complementation patterns
All are necessary techniques, but:
- There must be an initial lexical architecture;
- Efficacy of the results depends on application
model and corpus available.
13
Encoding Context
What is the Context of a Linguistic Utterance?
• Local context characterized as Strong Selection
• Broad context captured in part by Weak Selection
• Words encode context as types;
• Compositional rules refer to these types:
• Types can be selected;
• Types can be coerced.
• Types can be exploited.
• Composition can license new interpretations.
Basic Generative Lexicon
• Two classes of sortal constraints on a concept:
– Argument structure
– Event structure
• These bind into the Qualia Structure
• Compositional Rules invoke Type Selection
• Type Coercion: Inviolable Selection
• Type Exploitation: Subselection of type
features
Qualia Structure
 Formal: the basic category which distinguishes it
within a larger domain;
 Constitutive: the relation between an object and its
constituent parts;
 Telic: its purpose and function;
 Agentive: factors involved in its origin or “bringing it
about”.
Types and Words Select Different Things
Types:
- Operation: Selection Restrictions (semantic typing)
- Result: Possible combinations
Tokens:
- Operation: Corpus Selection (cospecification)
- Result: Probable combinations
17
Sense of a word depends on its context
• Consider the word treat:
Peter treated Mary badly.
Peter treated Mary with antibiotics.
Peter treated Mary with respect.
Peter treated Mary for her asthma.
Peter treated Mary to a fancy dinner.
Peter treated Mary to his views on George W. Bush.
Peter treated the woodwork with creosote.
• Dictionaries do not provide the contexts that
distinguish one sense of a word from another.
18
Problem: what context is relevant?
• The more senses a word has, the greater its lexical
entropy.
• How to decide what context features determine the
sense of a word?
• We want a data-driven sense definition.
– Sort contexts of use for a given word into
“buckets” to reduce lexical entropy
– Analyze features typical for each “bucket”
19
Corpus Pattern Analysis (CPA)
Corpus Pattern Analysis (CPA) is a corpus analytic
and automated induction technique that:
1. Identifies the typical syntagmatic patterns for each
word and determines discriminant context features.
2. Catalogs semantic types of arguments that are
relevant for distinguishing between different
senses.
3. Creates an inventory of syntactic and lexical
realizations for relevant semantic types.
20
CPA (II)
• Word senses are linked to syntagmatic patterns.
• Selection contexts of a word are the typical
syntagmatic patterns of its use.
• Selection contexts can be indexed on clauses and
phrases, as well as single words.
• Selection contexts are captured in CPA patterns.
Current work focuses on CPA patterns for verbs.
21
Research Areas Impacted by CPA
• Selectional preference acquisition
– Resnik (1996), Briscoe & Carroll (1997),
Abney & Light (1999), Korhonen (2002)
• Word sense disambiguation
– SENSEVAL efforts, Stevenson & Wilks (2001),
Aguirre et al. (2002)
• Ontology construction
– EuroWordNet, SIMPLE
22
CPA Components
• Lexical discovery
– Manual discovery of selection context patterns for
specific verbs through corpus analysis
• Automatic recognition of pattern use
– Sorting unseen instances of verb use according to
nearest match to identified patterns
– Similar to conventional WSD
• Automatic pattern acquisition
– Acquisition of patterns for unanalyzed cases
• Discriminant feature selection
• Predicate-based argument clustering
23
CPA Pattern Elements
• Syntactic Parsing
– Phrase-level parsing (clause roles)
• Shallow Semantic Typing
– Generic semantic features
– Brandeis Shallow Ontology
• Minor Category Parsing
– Adverbial Phrases, Locatives, Purpose Clauses,
Rationale Clauses, Temporal Adjuncts, etc.
• Subphrasal Syntactic Cue Recognition
– Genitives, partitives, bare plural/determiner distinction,
infinitivals, negatives, past participles, etc.
24
CPA Pattern Grammar
• Pattern grammar (fragment)
CPA-Pattern -> Segment verb-lit Segment | CPA-Pattern ';' Rstr
Segment -> Element | Segment Segment | '(' Segment ')' | Segment '|' Segment
Element -> literal | '[' Rstr ArgType ']' | '[' Rstr literal ']' | '[' Rstr ']' | '[' NO Cue ']'
Rstr -> POS | Phrasal | Rstr '|' Rstr | 
Cue -> POS | Phrasal | AdvCue
• For example,
[[Person]] assemble [[Artifact]]
[PLURAL[Person]] | [[Human Group]] assemble (in [[Location]])
[[Person 1]] treat [[Person 2]] ; NO Adv[Manner]
25
Corpus-Driven Type System
• Shallow Typing
– applying a shallow-type ontology to a parsed corpus
• Type Promotion
– promoting to type position lexical units breaking a
particular statistical threshold
• Lexical Sets
– predicate-based groupings of similarly typed lexical
elements from corpus
E.g. absorb: heat, light, energy, power, shock, wave,
sound, impact, movement
– populated through type-filtered cluster analysis, in each
argument position
26
Fine-tuning the Features
• Extending classification of minor
categories
e.g. adverbials of manner/effect
– Peter treated Mary rudely.
– Peter treated Mary effectively.
• Semantic features defining lexical sets
e.g. Energy (argtype for absorb)
– heat, light, energy, power, shock, wave, sound,
impact, movement
27
Implementation Details
• CPA patterns for an initial sampling of verbs is
derived manually
• A corpus is parsed (British National Corpus).
• A shallow type system is applied to the parsed
corpus (Brandeis Shallow Ontology).
• A training sample is created.
• Machine learning techniques are applied to
disambiguate the unseen instances using pattern
features.
28
Brandeis Shallow Ontology
• BSO Noun Coverage
– 3400 type nodes total
– 20,000 noun entries
– 10,000 nominal collocation entries
• 65 Shallow Types
– ‘Abstract’, ‘Asset’, ‘Animate’, ‘Artifact’, ‘Document’,
‘Human Group’, ‘Information’, ‘Institution’,
‘Location’, ‘Person’, ‘PhysObj’, ‘Process’, ‘Substance’,
‘Surface’, ‘Time Period’, etc.
• Subset of 24 shallow types was used in the
experiments.
29
RASP Statistical Parsing System
(Briscoe & Carroll, 2002)
• input tokenized, POS-tagged, lemmatized
• generates forest of full parse trees for each
sentence
• set of grammatical relations associated with
each parse analysis
– named relation, head, dependent
subjects: ncsubj, clausal (csubj, xsubj)
objects: dobj, iobj, clausal complement
modifiers: adverbs, modifiers of event nominals
• pick the top-ranked tree for the sentences
where full parse was a success
30
Selected context features
from RASP/BSO implementation
• obj_institution: object belongs to the BSO type
‘Institution’
• subj_human_group: subject belongs to the BSO type
‘HumanGroup’
• mod_adv_ly: target verb has an adverbial modifier, with a
-ly adverb
• clausal_like: target verb has a clausal argument introduced
by ‘like’
• iobj_with: target verb has an indirect object introduced by
‘with’
• obj_PRP: direct object is a personal pronoun
• stem_VVG: the target verb stem is an -ing form
31
Disambiguation accuracy
for sample predicates
verb
patterns training set decision tree
kNN
edit
2
100
87%
86%
treat
4
200
45%
52%
submit 4
100
59%
64%
32
Experimental Results
• CPA appears to be as accurate or better than other
techniques for WSD.
• Different types of ambiguities are resolved with
different degree of effectiveness.
• It will be tested on the latest SENSEVAL data.
33
Goals of CPA
• To create an inventory of semantically motivated
syntagmatic patterns, so as to reduce the ‘lexical entropy’
of each word.
• To develop procedures for populating lexical sets by
computational cluster analysis of text corpora.
• To collect evidence for the principles that govern the
exploitations of norms.
34
Lexical Discovery (creating patterns)
• Create a sample concordance for each word
– 300-500 examples
– from a ‘balanced’ corpus (i.e. general language)
[We use the British National Corpus, 100M words, and the
Associated Press Newswire for 1991-3, 150M words]
• Identify statistically significant collocates
• Classify every line in the sample, on the basis of its context.
• Take further samples if necessary to establish that a particular
phraseology is conventional
• Check results against corpus-based dictionaries.
• Use introspection to interpret data, but not to create data.
35
Every line in the sample must be classified
The classes are:
• Norms (normal uses in normal contexts)
• Exploitations (e.g. coercions and ad-hoc metaphors)
• Alternations
– e.g. [[Doctor]] treat [[Patient]] <> [[Medicine]] treat [[Ailment]]
•
•
•
•
Names (Sea Biscuit: name of a horse, not a cracker)
Mentions (to mention a word or phrase is not to use it)
Errors
Unassignables
36
Lexical sets are contrastive sets
• Different lexical sets generate different meanings.
• The lexical sets associated with each sense of each verb are
different.
– It remains to be discovered whether they are ‘transferable’.
• In principle, lexical sets are open-ended.
• In practice, a lexical set may have only 1 or 2 members,
e.g. take a {look | glance}.
• No certainties in word meaning; only probabilities.
• … but probabilities can be measured.
37
A Simple CPA Entry
toast, verb
1.
[[Person]] toast [[Food = bread, nuts, cheese]]
Implicature: cook or brown [[Food]] by exposure to radiant heat.
2.
[[Person 1]] toast {[[Person 2]] | success | memory}
Implicature: honour [[Person 2]] by raising a glass of wine, then
drinking some.
38
A more complicated verb: ‘take’
• 61 phrasal verb patterns, e.g.
[[Person]] take [[Garment]] off
[[Plane]] take off
[[Human Group]] take [[Business]] over
• 105 light verb uses (with specific objects), e.g.
[[Event]] take place
[[Person]] take {photograph | photo | snaps | picture}
[[Person]] take {the plunge}
• 18 ‘heavy verb’ uses, e.g.
[[Person]] take [[PhysObj]] [Adv[Direction]]
• 13 adverbial patterns, e.g.
[[Person]] take [[TopType]] seriously
[[Human Group]] take [[Child]] {into care}
• TOTAL: 197, and growing (but slowly)
39
Noun norms
• Norms for nouns are different in kind from norms for
verbs.
• Adjectives and prepositions are more like verbs than
nouns.
• A different analytical apparatus is required for nouns.
• Prototype statements for each true noun can be derived
from a corpus.
40
What are the components of a normal
context? – (2) Nouns
The apparatus for CPA (corpus pattern analysis) of nouns:
• Collocations.
41
Arranging collocates: storm (1)
WHAT DO STORMS DO?
• Storms blow.
• Storms rage.
• Storms lash coastlines.
• Storms batter ships and places.
• Storms hit ships and places.
• Storms ravage coastlines and other places.
42
Arranging collocates: storm (2)
BEGINNING OF A STORM:
• Before it begins, a storm is brewing, gathering, or impending.
• There is often a calm or a lull before a storm.
• Storms last for a certain period of time.
• Storms break.
END OF A STORM:
• Storms abate.
• Storms subside.
• Storms pass.
43
Arranging collocates: storm (3)
WHAT HAPPENS TO PEOPLE IN A STORM?
• People can weather, survive, or ride (out) a storm.
• Ships and people may get caught in a storm.
44
Arranging collocates: storm (4)
WHAT KINDS OF STORMS ARE THERE?
• There are thunder storms, electrical storms, rain
storms, hail storms, snow storms, winter storms,
dust storms, sand storms, tropical storms…
• Storms are violent, severe, raging, howling,
terrible, disastrous, fearful, ferocious…
45
Arranging collocates: storm (5)
TYPICAL QUALITIES OF STORMS:
• Storms, especially snow storms, may be heavy.
•
An unexpected storm is a freak storm.
•
The centre of a storm is called the eye of the storm.
•
A major storm is remembered as the great storm (of [[Year]]).
____
•
STORMS ARE ALSO ASSOCIATED WITH rain, wind,
hurricanes, gales, and floods.
46
Why norms are important
These statements about abate and storm represent
typical usage as well as typical meaning.
• They are empirically well founded (corpusderived).
• This is where syntax meets semantics.
47
How is CPA different from FrameNet?
CPA:
• investigates syntagmatic criteria for distinguishing different meanings
of polysemous words, in a “semantically shallow” way.
FrameNet:
• expresses the deep semantics of situations (frames);
• proceeds frame by frame, not word by word;
• analyses situations in terms of frame elements;
• studies meaning differences and similarities between different words in
a frame;
• does not explicitly study meaning differences of polysemous words;
• does not analyze corpus data systematically, but goes fishing in
corpora for examples in support of hypotheses;
• has problems grouping words into frames, and misses some;
• has no established inventory of frames;
• has no criteria for completeness of a lexical entry.
48
Challenges
•
•
Extending the empirical discovery of lexical sets
and other pattern features
Learning to recognize all the features required
by CPA patterns
49
Conclusions
•
•
•
•
Creation of a selection context dictionary
Development of a corpus-driven type system
Identification of meaning by a richer set of
criteria
Basis for investigating the mechanisms of
coercion and exploitation
50