Patrick_Hanks_EFNIL 2012

Download Report

Transcript Patrick_Hanks_EFNIL 2012

Modern Lexicography –
Developments, Prospects, and
Problems
Patrick Hanks
Research Institute of Information and Language
Processing
University of Wolverhampton
[email protected]
EFNIL, Budapest, 25 October, 2012
Outline of the talk
• Technology and Lexicography
– During the Renaissance
– Now
• Philosophy, linguistics, and lexicography
– During the Enlightenment
– Now
• Lexicography of the future
– The corpus revolution
– Presenting the facts to the public
– Can (should) natural language be regulated?
2
PART 1: Lexicography and technology
Lexicography as we know it today is possible because of two
technological developments during the Renaissance:
•The invention of printing (Gutenberg, Mainz, c. 1440)
– Enabling many copies of a work to be disseminated rapidly, regardless of its size, bulk,
and complexity.
•The invention of modern typography by Nicolas Jenson (Venice, 1470)
And the scholarship of Aldus Manutius (1449-1515) in Venice
•Manutius collected Latin and Greek manuscripts from all over Europe and
had them typeset and printed.
But in the past 10 years this kind of lexicography has become
obsolete!
•It has been superseded by a new kind of technology – text processing by
computer. I will discuss this in part 3.
3
The typography of Gutenberg‘s
Bible (c. 1455)
4
Nicholas Jensen’s Roman
Antiqua typeface (c. 1468)
5
Palsgrave (1530)
6
R. Estienne (1531)
7
Calepino: Basle edition, 1550
8
Promptorium Parvulorum in print
(Pynson 1499)
9
Present-day technology:
lexicographical evidence
hazard, verb.
1.
No one at this stage is prepared to
2.name -- Chicken.” “Not Hen Chicken?” I
3.
the wall. Stifling a giggle, she
4.
It seemed sensible to
5.can result in lost profits. When staff
6. them as Part I and Part 2. One might
7. North American standards. He does not
8.ecoming proficient. Perhaps we can now
9. builder, nor an architect, I can only
10.hair and eyes like her mother. I would
11. Where do your art materials live? We
12.
excitement than others, and I would
13.
age and some movies date. I would
14.
What the connection is we can only
15. have been lost and commandos were not
16.
shapes and colours from which we
17.
and a principle strong enough to
18. of the farmer is not revealed; we may
19.
to begin restoring. But I'd
20.
from time to time admire people who
21.
supreme grade of evil'. It may be
22. his achievement, such as it was, and
remembered
23.
in those stations' heyday, but I
24.
the day's racing. In fact I would
25. of society itself. Indeed, one could
26.The Phillips curve. Although Phillips
hazard
a guess at the outcome of the poll on
hazarded, as this humorous diminutive was part
hazarded a guess that the wardrobe would be full
hazard
that a man of this standing would have
hazard
a guess as to the price of goods – or
hazard
a guess that Part I was concerned with
hazard
any opinions on how costs depend on the
hazard
an attempt at defining `a good reader'.
hazard
a guess. During construction in the mid-19
hazard
a guess and say she would be, at the time
hazard
a guess that they're lurking in a shoebox
hazard
a guess that, even if they've never played
hazard
the guess that The Graduate belongs in
hazard
a guess at but it confirms all our worst
hazarded in foolish risks, although often taking
hazard
the inference that a leaping dog is in
hazard
lives for, America cannot hope to lead
hazard
the guess that he was William Hardeley,
hazard
a guess that if you restore the directory
hazard
their entire company on one major throw
hazarded that it was this inevitable alliance with
hazarded the opinion that he might best be
hazard
hazard
hazard
hazarded
a guess that considerably more passengers
a guess that one, if not both of these
a further, and more general, observation
some theoretical conjectures concerning
10
PART 2: Philosophy, linguistics, and
lexicography
• Do words have meaning?
• If not in words, where do meanings reside?
– Nowhere!
– Meanings are ephemeral interpersonal events, not stable objects
with a ‘residence’.
– But then how can anyone know what anyone else means?
– What do philosophers say?
– What do linguist say?
• And what is the nature of linguistic creativity?
11
Do words have meaning?
• Let’s think of a word: blow
• What does blow mean?
12
The meaning potential of a word
• What’s the meaning of blow? -– What the wind does? A disappointment? Something you do with
your fist? With your nose? With a whistle? Spend a lot of money? …
• What’s the meaning of blow up?
– Destroying a building? What you do to a balloon? Lose your
temper? …
– All of these things and more! Words are hopelessly ambiguous.
– A checklist of word meanings cannot, for principled reasons, be
exhaustive.
– But put a word in context, and ambiguity is reduced or eliminated.
13
Meaning potentials
• If words don’t have meanings, how come dictionaries have
been so successful?
• Strictly speaking, dictionaries list meaning potentials, not
meanings.
– The distinction is subtle but the theoretical consequences
are far-reaching
– When consulting a dictionary, human beings use their
imaginations to put words in a relevant context – a context
for which they are already primed (Hoey 2005)
– Computer programs and logic-based theories are not so
primed.
14
Philosophical background
• H. P. Grice (1957, 1975) argued that meanings are not
just in the head – they are events; interactions between
people:
– between speaker (S) and hearer (H);
– (and with displacement in time) between writer and reader
• For this to work, S and H must share a body of linguistic
conventions having the same meanings.
• Grice did not specify what the conventions are.
– He left that task to linguists and lexicographers
– So far, we seem to have let him down rather badly
15
Lexis and grammar
• Are the conventions that underlie conversational cooperation conventions of grammar (syntax)?
– Only partly. Discussed in more detail in Hanks (2012): ‘How
people use words to make meanings’.
• Perhaps the conventions that we rely on for conversational
co-operation are words, with meanings as given in
dictionaries?
– But two decades of research in Word Sense Disambiguation by
computational linguists (using LDOCE and other existing lexical
resources) is now seen as a failure (Ide and Wilks 2005)
– maybe, at least in part, because dictionaries don’t say enough
about phraseology
• Something else is needed.
16
Firth and Sinclair
“We must separate from the mush of general goings-on
those features of repeated events which appear to be
part of a patterned process.” —J. R. Firth (1950)
17
Idiomaticity vs. Open Choice
• “The principle of idiom is that a language user has available to
him or her a large number of semi-preconstructed phrases that
constitute single choices, even though they might appear to be
analysable into segments.”
—Sinclair 1991. Corpus, Concordance, Collocation, p. 110
• “Tending towards open choice is what we can dub the
terminological tendency, which is the tendency for a word to have
a fixed meaning in reference to the world. ... tending towards
idiomaticity is the phraseological tendency, where words tend to
go together and make meanings by their combinations.”
—Sinclair 2004. Trust the Text, p. 29
18
The importance of context
• “More often than not, activation of a particular meaning
depends on the co-occurrence of two or more lexical items” –
Sinclair
– The study of collocations is still in its infancy
– Empirical measurement of word co-occurrences (collocations)
only became possible with very large corpora (i.e. since the
early 1990s)
– Problem with small corpora (Brown, LOB, ICE):
• Impossible to distinguish significant collocations from chance
– We now have very large corpora – billions of words of
texts
• Contemporary corpora, historical corpora, domain corpora, …
– But serious analysis of corpus data has hardly started
– It requires both new tools and revision of received theories 19
Idiom and Open Choice
• The range of collocational norms varies greatly from
word to word
• What do you abandon?
– a car, [NO DET] ship, an old fridge, a plan, a theory, a baby,
a dog ( = as a pet), your wife and children, …
– Very open choice in the direct object slot.
• What do you hazard?
– The direct object slot is idiomatically highly constrained:
• just one word (guess) accounts for over 50% of uses of this verb
20
Exploiting the norm
• I hazarded various Stuartesque destinations like
Florida, Bali, Crete and Western Turkey.
—Julian Barnes
• Is it normal to hazard destinations or locations?
– No.
• This is an exploitation of a norm.
• We need a theory (and an artefact) that distinguishes the
normal, conventional, idiomatic phraseology of each word
from exploitations of those phraseological norms
21
Extended context
(Several exploitations here)
Stuart needlessly scraped a fetid plastic comb over his cranium.
—‘Where are you going? You know, just in case I need to get in touch.’
—‘State secret. Even Gillie doesn’t know. Just told her to take light
clothes.’
He was still smirking, so I presumed that some juvenile guessing game
was required of me. I hazarded various Stuartesque destinations like
Florida, Bali, Crete and Western Turkey, each of which was greeted
by a smug nod of negativity. I essayed all the Disneylands of the
world and a selection of tarmacked spice islands; I patronised him
with Marbella, applauded him with Zanzibar, tried aiming straight
with Santorini. I got nowhere.
22
PART 3: Lexicography of the future
• Will draw on prototype theory (Rosch 1972)
• Will aim to map cognitive prototypes (meanings, beliefs, etc.,
associated with each word) onto phraseological prototypes of
those words in use
• There will be an emphasis on analysing statistically significant
collocations
23
James Murray (1878) predicts the
need for corpus data
• “The editor and his assistants have to spend precious hours
searching for examples of common everyday words. Thus, in
the slips, we have 50 citations for abusion, but for abuse, not
five.” – James Murray, Presidential address to the Philological
Society, 1878
24
The need for a pattern dictionary
• To record all and only the normal patterns of use for each word
– Not meanings
– Not all possible patterns
• A pattern dictionary will be a benchmark against which actual
usage can be measured
• Meanings, implicatures, translations, and whatever-else-youlike are attached to patterns (not to isolated words)
– A word is no more than an entry point to a set of patterns
25
What is a pattern dictionary?
• A semantically driven syntagmatic inventory of normal word
uses and meanings (implicatures).
– Based on analysis of significant colligations and a statistically valid
random sample.
– Shows comparative frequency of each pattern of a polysemous word.
• Meanings are associated with patterns, not with words.
– The colligational preferences of a word are part of its patterns.
• Created by means of a painstaking technique called Corpus
Pattern Analysis (CPA).
26
Norms and exploitations
• A pattern dictionary aims to record all and only the normal
uses of each word.
– Exploitation of norms is a subject for separate analysis.
– Types of ‘exploitation’ include creative metaphor, ellipsis,
and (in particular) anomalous realizations. Consider:
• The goat ate the newspaper.
• The verb eat has a preference for nouns of semantic type [[Food]]
in the direct object clause role.
• ‘[[Animate]] eat [[Document]]’ is not a normal pattern of English.
• Compare John devoured the newspaper.
• ‘[[Human]] devour [[Document]]’ is a normal pattern of English. It
is a conventional metaphor.
27
Specifically, ...
The Pattern Dictionary of English Verbs
• aims to list all normal patterns of each verb lemma in BNC.
– with practical applications and theoretical consequences (see later).
• A benchmark for comparative studies of and identification of
norms in other corpora
– by time period: historical corpora, future corpora
– by region: e.g. American English
– by domain, e.g.
• ‘[[Human]] abate [[Problem = Nuisance]]’ is a domain-specific
norm in the domain of legal jargon
• abate is not normally a transitive verb.
28
A typical Pattern Dictionary entry
• irritate
PATTERN 1 (90%): [[Anything]] irritate [[Human]]
IMPLICATURE: [[Anything]] causes [[Human]] to feel mildly annoyed.
PATTERN 2 (8%): [[Phys Obj | Stuff]] irritate [[Body Part]]
IMPLICATURE: [[Phys Obj | Stuff]] causes [[Body Part]] to become inflamed
and somewhat painful.
• Notes:
1. Both these patterns are transitive but they have different meanings.
They are distinguished by the semantic types of the nouns
2. Getting the right level of semantic generalization for each noun is hard.
It must select normal, prototypical uses – not all possible uses.
29
Semantic type vs. contextual
role
• Mr Woods sentenced Bailey to seven years | life
imprisonment
PATTERN: [[Human 1]] sentence [[Human 2]] {to [[Time Period
| Punishment]]}
• Semantic type: [[Human]]
• Contextual roles: [[Human 1 = Judge]], [[Human 2 = Convicted
Criminal]], seven years [Time Period = Punishment in jail]]
– Semantic type is an intrinsic semantic property of a lexical
item.
– Contextual role is extrinsic; the meaning is imposed
(activated, selected) by the context in which the word is
used.
30
Nouns and verbs
• The analytical apparatus required for nouns is different in kind
from that required for predicators (verbs, adjectives).
– Nouns are grouped into lexical sets in relation to the predicators that
they normally collocate with.
– The lexical sets are normally united by a semantic type.
– A shallow ontology of nouns (grouped by their semantic type) is
therefore part of the apparatus of a pattern dictionary.
– Semantic types in real texts are more complex than might be expected
at first sight or from invented examples.
– Lexical sets include alternations, parts, and properties of types
31
What would an empirically
well-founded ontology be like? (1)
• A hierarchy of about 250 semantic types (not more)
• Representing the intrinsic conceptual semantic properties of words
– [[Eventuality]] and [[Entity]] at the top
– [[Eventuality]] = [[Event | State of Affairs]]
– [[Entity]] = [[Physical Object | Abstract Object]]
• Each semantic type is governed by corpus evidence of colligations, e.g.:
• [[Human]]s and [[Animal]]s eat, run, sleep, etc.
• [[Human]]s and [[Institution]]s think, say, negotiate, etc.
• So snakes (for this purpose) are not animals
• The hierarchy of [[Artefact]]s has many members, because different
artefacts are used for different purposes (= with different verbs).
• Ref. James Pustejovsky, 1995. The Generative Lexicon.
32
What would an empirically
well-founded ontology be like? (2)
• It would have to take account of verb-specific lexical
alternations (parts and properties).
• For example, Pattern 2 (of 8) for calm, verb, is:
[[Human 1 | Event]] calm [[Human 2]]
– Alternation of Human (2): [[Animal]]
– Parts of Human (2): nerves [[Body Part | Psyche Part]]
– Attributes: [Possessive Determiner]] fear, anxiety, agitation, ....
[[Emotion]]
33
Argument alternation and focus
1. Straightforward alternations:
– People negotiate, governments negotiate, …
– Humans eat, horses eat, dogs eat, alligators eat …
– Horses gallop, humans gallop [ambiguous]
2. Another function of argument variation is focus:
– repair one’s car, repair the fender, repair the damage
– treat a person, treat his ankle, treat the injured, treat their
injuries
– The meaning of treat here contrasts with the meaning in
treat a person well/badly
The presence or absence of a manner adverbial is all-important
34
How to Measure Collocations?
Various statistical tools are available, e.g.:
• Mutual Information (“MI”; Church and Hanks 1990)
– tends to favour content words as collocates
• t-score tends to favour function words as collocates.
• Sketch Engine (Kilgarriff, Rychlý, et al., 2004)
– measures salience scores for pairs of collocates in pre-determined
colligational patterns
• Take your pick – but measuring must be done, one way or the
other, if we are to have any hope of understanding the nature
of meaning in language nd getting our dictionaries to report
accurately how words are used
– because a natural language is a fuzzy, variable, analogical, unstable
system for making meanings
35
The Pattern Dictionary and FrameNet
PDEV is corpus-driven (ruthlessly empirical) and proceeds word by word,
investigating syntagmatic criteria for distinguishing different meanings of
polysemous words, in a “semantically shallow” way.
FrameNet proceeds frame by frame. It:
• expresses the deep semantics of situations (frames);
• proceeds frame by frame, not word by word;
• analyses situations in terms of frame elements;
• studies meaning differences and similarities between different words in a
frame;
• does not explicitly study meaning differences of polysemous words;
• does not analyse corpus data systematically, but goes fishing in corpora for
examples in support of hypotheses;
• has problems grouping words into frames, and misses some;
• has no established inventory of frames;
• has no criteria for completeness of a lexical entry.
36
Construction Grammar (1)
•
•
•
•
Focus on meaning, not just well-formedness.
Challenges reductionist theories of language
Meaning is (in part) associated with constructions.
Anything from a word to a clause can be a construction.
– Example: ‘she slept her way to the top.’
– Sleep is not normally a goal-achievement verb.
– But in this sentence, it is coerced into being one by the construction
“[V] one’s way to [[Status]]”.
– This meaning is not arrived at by a concatenation of the meanings of
the lexical items of which the sentence is composed.
37
Construction Grammar (2)
• So far so good – but Construction Grammar is in the
speculative tradition. It is not based on analysis of evidence.
• It is based largely on made-up examples, many of which are
bizarre, e.g. The gardener watered the flowers flat.
• Corpus evidence shows that the verb water does not normally
participate in the resultative construction.
• A distinction between normal usage and exploitation of norms
must be made.
– Abnormal examples are conducive to distortions in the theory.
– CG needs corpus analysis.
– Some sort of synthesis between PG and CG is desirable.
38
Theoretical consequences and
practical applications (1)
Pedagogical:
• Anyone acquiring a language must learn competence in two
kinds of rule-governed linguistic behaviour:
– How to use words normally
– How to exploit the norms (creative metaphors, ellipsis, etc.)
• A pattern dictionary gives comparative frequency of patterns.
– A lexical syllabus will focus on statistically significant patterns of use.
• In error analysis: what norm was aimed at?
– If learners are exploiting norms creatively, do you (the teacher) really
want them to?
39
Theoretical consequences and
practical applications (2)
For theoretical linguistics:
• Are some grammars better than others for representing how
words are used to make meanings?
‘S  NP VP’: confuses of language with predicate logic
• The third argument (‘adjunct’, ‘adverbial’):
– Not well analysed in generative grammar (or, indeed, any other
grammar)
– CPA shows that a new grammar of adverbials is needed.
• Metaphor analysis:
– CPA distinguishes conventional metaphors from exploitations.
40
Theoretical consequences and
practical applications (3)
• For computational linguistics and AI:
• Improving machine translation
– Getting the right pattern is more likely to select the right translation.
• Parsing and word-class tagging:
– CLAWS achieves ~90% accuracy in word-class tagging in BNC
– CPA reveals some systematic errors in CLAWS tagging.
• Anaphora resolution:
– He found a glass of water on the table and drank it.
– ‘[[Animate]] drink [[Liquid]]’ selects water as a direct object of drink
41
Presenting the facts to the public (2)
• Dictionaries of the future will be electronic products
– Space constraints removed
– leading to a danger of verbosity
• They will pay more attention to phraseology and collocation
• Language communities will still need lexicographers to
analyse the lexical content of corpora, Internet data,
conversation, etc., and to identify the phraseological
conventions on which successful communication depends
• You can’t just plonk language learners down in front of a
concordance (corpus data) and expect them to work out what
is going on. The data needs an interpretation.
42
Phraseological Lexicography and
Computational Linguistics
• At present NLP applications such as machine translation are
having great success with “knowledge-poor” statistical
methods.
– Sooner or later the pendulum will swing back: lexicographical methods
will be needed to augment the raw statistical approach
• According to Ken Church, in 1987 the single most productive
contribution to the NLP text-to-speech generation system at AT&T
Bell Labs came from the IPA transcriptions in Collins dictionaries
• Can we expect a similar contribution from phraseological
lexicography to computational message understanding?
43
Phraseological Lexicography and
the Semantic Web
• Semantic Web: the original dream:
– “Web technology must not discriminate between the scribbled draft and
the polished performance.” –Berners-Lee, Hendler, and Lassila, in
Scientific American 2001
• At present Semantic Web research is very far from being able
to interpret polished performances, let alone scribbled drafts
– It confines itself to identifying names, dates, address, and
appointments, and to processing tags that have been added to elements
in text.
– It is “the apotheosis of annotation – but what are its semantics?” (asks
Yorick Wilks)
• Realizing the dream will require lexicographic input – a radical new
kind of lexicography, one possibility for which I have tried to
outline in this presentation.
44
A model presentation
• OED3 is a model of electronic presentation
– but its lexicographical principles are old: they are (rightly)
those of the Renaissance and the Enlightenment
– These principles need revision in the light of corpus evidence
– But you interfere with a national monument at your peril
– One of many unacknowledged theoretical problems is a
confusion between the (stipulative) meaning of scientific
concepts and the meanings of words in natural language.
• Dictionaries of the future will be based on the principles of
Wittgenstein, Rosch, Putnam, Grice, Firth, and Sinclair.
45
Can (should) natural language be
regulated? (1)
• Johnson’s dictionary (1755) was based on citations from “the
best authorities”.
• “Those who have been persuaded to think well of my design
require that it should fix our language...
• “When we see men grow old and die ... we laugh at the elixir
that promises to prolong life to a thousand years; and with
equal justice may the lexicographer be derided who, being able
to produce no example of a nation that has preserved their
words and phrases from mutability, shall imagine that his
dictionary can embalm his language and secure it from
corruption and decay.” —Preface, Dictionary, 1755
46
Can (should) natural language be
regulated? (2)
• Johnson’s liberal empirical descriptivism is OK for English
• But what about other language situations, e.g.
– Norwegian (institutionalized diglossia)
– Czech (Every literate user of Czech must be able to use standard
literary Czech, as well as his or her local dialect – but but standard
literary Czech is not a natural language)
– Greek? (katharevousa is obsolescent)
– Langauges without a strong literary convention, e.g. Bantu languages,
such as Northern Sotho, Zulu, Luganda. An element of prescriptivism
seems to be inevitable here.
– What about French? What is the role of the Académie Française in this
brave new world of computational language processing?
• These are subjects on which I am not qualified to speak.
47
Thanks
• To you, for listening,
• To the late John Sinclair and the (still extant) James
Pustejovsky, who have inspired this approach,
• To the Academy of Sciences of the Czech Republic (project
T100300419) and the Czech Ministry of Education (National
Research Program II project 2C06009), who, in part, funded
the pilot study on which PDEV is based,
• And to Karel Pala, Pavel Rychlý, Adam Rambousek, and
Adam Kilgarriff, who have created tools that make this kind of
analysis possible
48
Invitation to browse the Pattern
Dictionary
•
•
•
•
Fire up a Firefox browser window.
VISIT: http://nlp.fi.muni.cz/projects/cpa
Pattern Dictionary of English Verbs:
http://deb.fi.muni.cz/pdev/
49