RECOGNISING NOMINALISATIONS

Download Report

Transcript RECOGNISING NOMINALISATIONS

RECOGNISING
NOMINALISATIONS
• Supervisors: Dr. Alex Lascarides
Dr. Mirella Lapata
• (Andrew) Yuk On KONG
• University of Edinburgh
DEFINITION
• “Nominalisation refers to the process of forming a
noun from some other word-class. (e.g. red+ness)
• or (in classical transformational grammar
especially) the derivation of a noun phrase from an
underlying clause (e.g. Her answering of the
letter….from She answered the letter).
• The term is also used in the classification of
relative clauses (e.g. What concerns me is her
attitude)…….” (Crystal 1997)
• Nominalisations (1st definition) from verbs only
are considered here, e.g. "statement" from "state".
• Problem: WORD--noun? from a verb or not?
• Nominalsations derived from verbs are very
productive in English and are usually created by
means of suffixation (i.e., suffixes that form nouns
are attached to verb bases).
EXCLUSIONS
• Nominals, e.g. the poor, the wounded
• Nominalisation NOT From Verb, e.g.
redness
• -ing form, e.g. the making of the movie
• Antidisestablish-ment-arian-ism
REGULAR?
•
•
•
•
•
Nominalise
Interpret
Interrupt
Associate
delete
• break
• leak
nominalisation
interpretation
interruption
association
deletion
breakage
leakage
• Confine
• Refine
(but
• define
confinement
refinement
• submit
• admit
• remit
submission
admission (but also admittance)
remission; remittance; remit
definition)
VERB=NOUN
•
•
•
•
•
•
•
•
Debate
Pay
Love
Boss
Stand
purchase
Lie
(cf lie down)
Debate (not debation); debater
pay
love
boss
stand
purchase
lie (“tell a lie”)
VERB=NOUN (except stress)
•
•
•
•
transfer
transport
import
rebel
transfer
transport
import
rebel; (rebellion)
1 VERB, >1 NOUNS
•
•
•
•
•
Collect
Interpret
Cover
Conduct
Depend
collection; collector
interpretation; interpreter
cover; coverage
conduction; conductor;
dependant/dependent;
dependence; dependency
SEMANTICS
• Conduct
• Conduct
conduction(conduct
electricity/heat)
conduct
(behave/organise)
WHEN TO USE WHICH
SUFFIX
• -tion/-sion
• er/or
• Debate
• Talk
debater
talker
• Collect
• Conduct
collector
conductor
IRREGULAR
NOMINALISATION
•
•
•
•
Choose
Succeed
Decide
Sell
choice
success;succession;successor
decision
sale
PSEUDO-NOMINALISATION
• mote??
Motion
(noun; a very small piece of dust)
• Depart
Departure; Department???
• Apart
apartment????
WHY BOTHER?
• The identification of nominalisations and
their associated verbs (e.g. "statement" and
"state"). important for a number of NLP
tasks:
– machine translation
– information retrieval
– automatic learning
dictionaries
– grammar induction
of
machine-readable
HOW ?
• nominalisation is a productive
morphological phenomenon:
• list all acceptable nominalised forms?
• New words?
techniques NOT focusing on
nominalisations
• build rules
• machine-learning approaches to induce
morphological structures using large
corpora
• knowledge-free induction of inflectional
morphologies (Schone and Jurafsky 2001).
SCHONE AND JURAFSKY
(2001)
• Schone and Jurafsky (2001) have performed work
for acquiring cognates and morphological
variants.
–
–
–
–
–
Induced semantics—Latent Semantic Analysis (LSA)
Induced orthographic info
Induced syntactic info
Transitive information
Affix frequencies
GOAL OF THIS STUDY
• The principal goal of this project is to
develop a system which can recognise
nominalisations, together with the verbs
from which they are derived.
EXPERIMENT 1 (baseline)
• identify nouns using the tags in the corpus
• identify potential nominalisations from the list of
nouns with a list of nominalisation suffixes
• find the corresponding potential verb for each by
identifying the verb (from among verbs as tagged)
that shares with it the greatest number of letters in
sequence
• accept a pair of nominalisation and verb if the %
letter matched > 50% and discard any other
EXPERIMENT 2
• using decision tree to build a model
• possible features include:
-letter similarity between verbs and nouns
-suffix frequency
-verb frequency
-verb semantics
-subject of noun
-subject of verb
EVALUATION
• experiments will be based on the BNC
corpus.
• The obtained nominalisations will be
evaluated
against
the
CELEX
morphological lexicon and manually
annotated data.
• Precision, recall and F-score
BRITISH NATIONAL CORPUS
•
•
•
•
Over 100 million words
Corpus of modern English
Both spoken (10%) and written (90%)
Each word is automatically tagged by the CLAWS
stochastic POS tagger
• 65 different tags
• encoded using SGML to represent POS tags and a
variety of other structural properties of texts (e.g.
headings, paragraphs, lists, etc.)
• <item>
• <s n=086>
• <w NN1-VVG>Shopping <w PRP>including <w
NN1>collection <w PRF>of
• <w NN2>prescriptions
• </item>
• <item>
• <s n=087>
• <w VVG>Daysitting <w CJC>and <w
VVG>nightsitting
• </item>
CELEX
• English, Dutch and German
• Annotated by human using lemmata from
two dictionaries of English
• 52,446 lemmata and 160,594 wordforms
• orthographic, phonological, morphological,
syntactic and frequency information
• morphological structure, e.g.
((celebrate),(ion))
MILESTONES
• 6/2002 Experiment 1—baseline
• 7/2002 Experiment 2
• 8/2002 Write-up
• 9/2002 Finalise report