Lecture slides: Morphology and Morphological Processing

Download Report

Transcript Lecture slides: Morphology and Morphological Processing

Morphology,
Morphological Processes and
Morphological Processing
John Barnden
School of Computer Science
University of Birmingham
Natural Language Processing 1
2015/16 Semester 2
Overview
• Morphology is to do with the “shape” (internal structure) of words
and how the shape changes to reflect certain common, fairly
systematic changes of meaning. E.g.:
– forming plurals of many nouns by adding an “s”; but also irregular plurals (e.g.
“goose” to “geese”);
– forming the past tense of many verbs by adding “ed”;
– going from “buy” to “buyer”;
– going from “happy” to “unhappy”;
– forming “doghouse”.
• Many such changes (not all) involve adjustments of substructure.
• Such substructure is a matter of decomposition into meaningbearing/affecting subunits (as opposed, e.g., to individually
meaningless letters or letter-strings).
Overview, contd
• Morphological processes are the ways, alluded to on the
previous slide, in which words can, so to speak, change [more
exactly: certain ways in which words are related to each other].
• Morphological processing is about how to computationally
convert between words according to morphological processes,
how to analyse words into their components if any, and how to
create words from such components.
– Lectures will cover a start on this, with further detail left to the textbook.
– The basic tools are regular expressions or (equivalently) finite state
automata.
Morphemes
• Morphemes are the components of words that we will be
considering.
• They are variously described as (for any particular language):
• the minimal units of grammatical analysis
• the minimal units of meaning
• the minimal units that bear meaning [J&M p.81]
• Possibly better :
– the minimal units that bear or affect meaning.
Examples of Morphemes
• Consider the word “unhappiness”:
composed of three morphemes, each carrying a certain amount of meaning:
– “un”
here means opposite of [or not in many other cases]
– “ness”
means being in a state or condition
– “happy”: the familiar word (slightly modified by being combined on the right).
• One classification of morphemes:
– “happy” is a free morpheme because it can appear on its own and still mean the same
as in the word above.
– “un” and “ness” are bound morphemes as they have to be attached to a free
morpheme – they can’t mean what they do above when standing on their own.
• But:
– There is a completely unrelated word “ness”.
– There is a rather informal word “un” derived from the “un” morpheme, meaning
something like “unimportant, characterless, ineffectual, ...”
– “happy” can act as part of a bigger word in other ways, as in “trigger-happy”.
Affixes
• Affixes are an important type of (usually bound) morpheme,
usually small, and making a largely predictable meaning change
that’s largely independent of what they are applied to.
• In “unhappiness”, there are two kinds of affix:
– a prefix, “un”
– a suffix, “ness”.
• There are also infixes in some languages (and perhaps, in special
cases, in English).
– Bontoc [from the Philippines uses] infix “um” to change adjectives and
nouns into verbs. So the word “fikas”, which means strong is transformed
into “fumikas” meaning be strong.
– English: “unhappy”  “unbloodyhappy” [slang] – and similarly with
some other favourite swear words.
• See textbook for circumfixes.
Affixes, contd
• Other examples of prefixes:
– “re”: conveys repetition or renewal, as in “redevelop” or “retile” or “reexamine”
– “mis”: conveys doing something wrongly, as in “misremember”
– “de” and “dis”: convey removal, undoing or reversal, as in “depopulate”,
“disembowel”, “disappear”
– “in” [or “im” before some consonants]: often conveys negation, as in
“indecisive”, “immeasurable”, “imperfect”
– “in” [or “im”]: can also convey being/putting into a state, as in “invigorate”,
“inflammable”
– “anti”: indicates opposition, as in “antisemitic”
– “ante”: indicates beforeness [spatial or temporal], as in “antenatal”
– “pre”: indicates beforeness [spatial or temporal], as in “prefix” !
Affixes, contd
• Other examples of suffixes:
– “ing”: changes a verb infinitive into progressive form, as in “buying”
– “s” [changed to “es” after some consonants]: makes a noun plural or
changes a verb infinitive into 3rd-person singular
– “ed” [or “d” or “t”]: changes a verb infinitive into past tense or past
participle form
– “ity”: makes a noun out of an adjective, as in “activity”, “purity”
– “less”: indicates lack of something [could perhaps be considered a free
morpheme because so close in meaning to the word “less”]
– “ish”: indicates likeness, closeness or somewhatness, as in “bluish” and
“city-ish”
– I invented “somewhatness” by freeish application of morphology!
Cautions re Affixes, etc.
• Letters at beginning or end of a word can of course look like, but not be, a
particular affix. Two examples:
– “re” is sometimes not the abovementioned prefix, as in “regal”, “ready”, “region”
– “ly” is sometimes not the adverb-creating suffix, as in “holy”, “lily”, “hilly”
• A word need not contain any free morpheme:
– E.g., “inhere”, “cohere” and “adhere” are all formed from a commonly prefixed
bound morpheme plus the morpheme “here” – the latter means to stick (from a
Latin verb) but is not itself usable as a word of English with such a meaning.
• Affixes can be concatenated (strung together) to some extent :
– E.g. : “morphologically”, “antidisestablishmentarianism”, “moralizing”
– Some languages, e.g. Turkish, allow concatenation much more extensively (see
textbook).
Cautions re Affixes, etc., contd.
• Affixes can adjust the meaning of what they’re affixed to in
somewhat subtler ways than you might expect:
– E.g., “entomb” comes from “en” [meaning put in] and “tomb”, but it usually has a
broad metaphorical meaning, and less usually means put in a [literal] tomb.
– Such formation of verbs from nouns often uses metaphorical meanings of the
main morpheme rather than literal meanings.
Word Stems
• A word often has one intuitively-main morpheme:
– “unhappiness”: main morpheme is “happy”
• The main morpheme is called the “stem” of the word, and may be the whole
word.
• We’ll see below that a word can contain more than one free morpheme, and
in such cases the idea of a stem may be more difficult.
Types of Morphological Process
• It’s convenient to divide morphological processes into four rough
types:
– Inflection
– Derivation
– Cliticization
– Compounding
• It’s difficult to devise a precise definition of these types, even
within a single language.
• And there’s some overlap.
Inflection
• Inflection is a morphological process that varies a word
– in certain very limited, standard, predictable ways,
typically via affixes,
– keeping some large part of meaning intact,
but changing the values of certain standard parameters.
• The variations are usually tightly related to the grammatical structure of the
surrounding expression.
•
Examples in English on next slide ...
Inflection: Examples
• Nouns:
– Pluralizing a singular noun (a basic example: “cat” to “cats”)
– Forming possessive forms of a noun (a basic example: “cat’s ” and “cats’ ”)
• Pronouns and related adjectives:
– Setting the case/number (e.g., in varying between nominative “I/we/who”, accusative
“me/us/whom”, possessive forms “my/mine/our/ours/whose”) . Also demonstrative
pronouns and adjectives: “this/these”, “that/those”.
• Verbs:
– Setting the case, number, tense, etc. (infinitive “eat” to “eats/ate/eaten”; “be” to
“am/is/are/was/were”)
– Forming the present participle by adding “ing”, used in progressive constructions (“I
am / was / will be buying”) and as a gerund (a form of noun, as in “the cutting of the
cake”).
• Adjectives and adverbs:
– Forming the comparative and superlative forms (e.g., “big” to “bigger” and “biggest”;
“fast” [as adverb] to “faster” and “fastest”; and in colloquial English “quickly” to
“quicker” and “quickest”).
Inflection, contd.
• Other languages may do things not done, or not done much, in English, e.g.:
– inflect nouns for case (nominative, accusative, etc.)
– inflect a definite article for case and number (whereas in English it’s always just
“the”).
• Conversely, other languages do not do some things English does (e.g.,
Japanese nouns don’t have plural or possessive forms).
[Inflection, contd.]
• Inflection often involves certain systematic spelling changes to the stem, e.g.
– Final “c” becomes “ck” as in “picnic” to “picnicking”
– Dropping of a single, not separately pronounced “e” when adding “ing” (but don’t
drop when have “ee”)
– Doubling of final consonant when adding a suffix starting with “e” or “i” (as in
“beg” to “begging”, and “big” to “bigger”).
• Inflection includes cases where meaning variations of the sorts on previous
slide are reflected by irregular word forms (consider irregular verbs such as
“be”, irregular plurals such as “geese” and “mice”).
So inflection is not just about systematic lexical changes. (The textbook is
slightly inconsistent on this.)
• Inflection includes the case where the word form is actually unchanged (e.g.
“hit” : infinitive and past-tense form and past participle).
Towards Morphological Processing
• NLP systems for English often don't include any or much morphological
processing, especially if they are small-scale systems or systems with
specialized purposes such as informational retrieval.
– Just list all the different word forms separately.
– Or may just “stemming” (finding the stem of each word) — a limited sort of
morphological processing
• Inflectional morphology
– For other languages, e.g. French and German, NLP systems often include inflection
analysers.
– When inflectional analysis is done:
 A standard technique is Finite State Automata – see below and textbook.
 FSAs are powerful and economical for the regular cases, and exceptions (the
irregular words) are just included as extra regions in the network of states.
A Small Morphological Analyser
[courtesy of Dr Peter Hancox]
• Designed for a tiny fragment of English, and treats even that fragment
incompletely.
• Covers just the and the nouns girl, girls, cat, cats
and the verbs trust, trusts, trusting, trusted.
• Produces two types of information:
– Syntactic category: noun, verb, determiner.
– Grammatical features:
 Number: ie singular, plural
 Person: ie first, second, third
 Tense: ie past, present
 Participle: ie yes, no.
• Takes the form of an FSN (Finite State Network/Automaton/Machine) ...
Small Morphological Analyser: the FSN
•
SEE DIAGRAM IN SEPARATE FILE available via the module slides page:
•
http://www.cs.bham.ac.uk/~jab/Modules/NLP1/14-15/Lectures/morphology.FSNdgm.hancox.pdf
More Detail, and Transduction
• Please read sections 3.1 to 3.7 of J&M.
• The above FSN is merely a recognizer.
• J&M 3.4 etc. goes into transduction, i.e. conversion of forms (e.g. from
singular to plural and back).
• Also involves a somewhat different way of specifying the result of recognition.
Derivation
• Derivation is a morphological process that varies a word
– in ways not covered by inflection, and less systematic
but still by means of relatively small changes in form such as adding affixes (or no
change at all),
– usually involving a bigger shift of meaning
that is somewhat unpredictable
• Examples in English ....
Derivation: Examples
• Making adjectives into adverbs by suffixing with “ly”.
• Making nouns (etc.) into adverbs by suffixing with “wards”, as in “sidewards”.
• Nominalizing (= “nounifying”) verbs by suffixing with “ation” or “ment” (as in
“payment”), “ee” (as in “payee”), “er” (as in “payer”).
• Making nouns into verbs without changing the spelling (as in “pencil”,
“book”, “impact”, “carpet”, “bus”, “powerpoint”).
• Verbifying nouns by suffixing with “ify”! Or with “ise/ize”.
• Nominalizing adjectives by suffixing with “ness”, “ity”.
• Making nouns into adjectives by suffixing with “ish”, “y” (as in “frilly”),
“[-]like”, “less”, “[e]d”.
• Making verbs into adjectives by suffixing with “able/ible”.
• Other more ad hoc cases such as in “iffy” (from “if”), use of “big” as a verb (in
“big it up”), ...
Can you think of any other cases?
Unclear inflection/derivation Boundary
• Inflection usually doesn’t change the [traditional] POS of the affected word
(e.g. verbs stay as verbs) whereas derivation usually does change it, but there
are exceptions.
– E.g. The textbook includes within inflection the formation of the gerund (i.e.
noun) form of a verb by adding “ing”, even though this changes the POS.
– Adding the affix “dom” (as in “kingdom” and “martyrdom”) makes too big and
unpredictable a difference in meaning to fit with inflection, but doesn’t change
the POS (still a noun).
– Adding “er” to get a noun indicating the doer of something is a derivation process
that can be done not only on verbs (“baker”) but also on some nouns
(“philosophy” to “philosopher”).
– Similarly the suffix “ist” converts between nouns (“art” to “artist”).
[Unclear Boundary contd]
• Example: is adding “ish” an act of inflection or derivation or both?
– It can deliver an adjective from an adjective or a noun, but seems odd to say it’s
inflection in one case and derivation in the other.
– When modifying an adjective, it’s not obviously making a more major meaning
change than forming a comparative or superlative, or going from one tense to
another of a verb.
Compounding
• Compounding is the morphological process whereby one or more words are
combined to form a word, as in
– football, basketball, raquetball
– doghouse
– blackboard
– winklepicker
– bargain-hunter
– postman, yes-man
– catch-all, know-all
– pear-shaped
– eggbeater
– cheeseburger [compound of “cheese” and the abbreviated word “burger”, with
some confusion about “ham”!]
– blue-green, pike-perch [which word is the stem?]
[Compounding, contd.]
• When an affix, such as “able”, is a free morpheme (i.e., also a word, with a
very similar meaning) should we also view the affixing as compounding?
[Compounding, contd]
• Joining words without a hyphen is largely a matter of convention (i.e. implicit
agreement) in English – you can’t just freely form such compounds.
• And in English, compounding with a hyphen is relatively free, and putting
nouns next to each other without joining or hyphenating to form so-called
“noun-noun compounds” gets a similar effect and is extremely free.
– telephone-licker
– telephone licker
– telephone licker defender
– telephone licker defender scandal
• Joined-up compounds, hyphenated compounds and noun-noun compounds
don’t have meanings that are predictable in any simple, uniform way from the
individual words:
– Postman: man who delivers post
– Fireman: man who delivers fire?!
– But note: South of France forest firemen arson scandals
Cliticization: Examples
• Adding “not” in the form “ n’t ” to certain verbs: to be, to have or auxiliaries:
‍
“isn’t”, “mustn’t”, “don’t”, “didn’t”, “haven’t”, “can’t”, “shouldn’t”, etc.
‍ NB: special cases “can’t”, “won’t”, “shan’t”, “ain’t”.
‍ Exercise: in what ways are these special??
‍ NB also: “cannot”.
• Can’t usually do the above when the verb is not to be, to have, or an auxiliary.
Can’t say:
– “I don’t my push-ups any more” to mean “I don’t do my push-ups any more”.
– “I didn’t him in” to mean “I didn’t do him in” i.e. “I did not kill him.”
– “I can’t my tomatoes on Saturdays” to mean “I don’t can my tomatoes on Saturdays”.
Cliticization: Examples, contd
• Adding “is”, “are”, “will”, “would” and “am” to previous word as in
– “It’s in the garden”
– “The cat’s in the garden”
– “They’re in the garden”
– “The horse’ll be in the garden”
– “I’ll be in the garden”
– “I’m in the garden”
– “I’d be in the garden if three cats, five dogs, a horse and a strange professor weren’t
already there.”
• Adding “has”, “have”, “had” as in
– “The cat’s already been in the garden for five hours”,
– “The cats’ve already been in the garden for five hours”,
– “You’ve ten minutes to get that horse out of there”,
– “I thought you’d already got it out”.
[Cliticization: Some Special Cases]
• “of”
‍
in the form “ o’ ” in “clock”,
and in proper nouns : “O’Connell” “O’Gaunt”.
• “my” in the form “mi” or “ m’ ” in “milord”, “m’lord”, “m’lud”, “m’boy”.
• “to”
in the form “a” in some verbs as in colloquial “gonna”, “wanna”.
• “the” in the form “ th’ ” as in “th’morn” in older English esp. poetry.
• “the” in the form “ t’ ” as in “ t’cat ” in some Northern English dialects.
• “it”
in the form “ ’t” as in “ ’twas / ’twere / ’twill / ’twould ”.
• “one” in the form “un” as in “biggun”, “smallun”.
• “and” in the form “ ‘n’ ” as in “pick’n’mix”.
• Remember, AI systems have to deal with dialect, slang, etc. not just proper
Queen’s English (or Kate Middleton’s)!
Cliticization, contd.
• The small added word is called a “clitic”.
– Proclitic if before the other word
– Enclitic if afterwards
• [Clitics in (modern everyday) English are almost all enclitics. You find some
proclitics in isolated forms and in dialect or older versions of English as above.
• Proclitics are common in other languages, e.g. French:
•
adding “le” and “la” to the next word in the form “ l’ ” when it starts with a vowel or
an “h” (usually), as in “l’arbre” and “l’homme”.
• In English, the clitic is almost always separated off by an apostrophe and
abbreviated. This doesn’t necessarily carry over to other languages. (See
textbook.)
• [The notion of clitic is not easy to define clearly – see textbook and dictionaries.
Lack of stress on the clitic in the pronunciation of the word is typically
mentioned, but I’m not convinced that this is a valid criterion.]
Cliticization and Affixes
• The textbook says that a clitic is somewhere between being a
word and being an affix.
• You might ask why clitics aren’t just classified as a particular form
of affix.
‍ One reason: words involving clitics act grammatically like the
phrases they came from rather than like single words (even
when the compounding is obligatory):
– E.g. “You’ve” acts grammatically like the phrase “You have”.
– Although in French we can’t actually use “le arbre”, nevertheless “l’arbre” acts
grammatically as that phrase would have done had it been allowed, and acts like
analogous phrases such as “le chat”.
‍ And a clitic can be added to a whole phrase, as in
– “The man I was speaking of’s been here again”
[Cliticization: Remarks]
• Cliticization is
– [in my view] a special, exceptional form of (joined-up) compounding, where
– the joining up is done for brevity or ease of pronunciation of a phrase rather than
specifically to create a word:
– the resulting word acts like the phrase it would have been had the joining not
been done, or like similar phrases, rather than like a normal single word.
– But despite the mere brevity/ease motive, the cliticization can be obligatory.
• Cliticization is unlike normal compounding in that it can act on a whole
phrase, not some separate words.
– But so can fairly normal compounding (see South of France example and this
underlined phrase!)
[Apostrophes are Complicated]
• Apostrophes are often used in abbreviations to indicate missing letters, as in “
’phone” [old-fashioned], “B’ham” on road signs and “ mornin’ ” in dialect.
• Apostrophes are often left out in computer-mediated chat, texting (SMS), etc.
‍
There’s a modern tendency to miss out apostrophes in possessive forms of nouns
in official building signs, street signs, etc. (“Snooker Players Convelescent Home”).
• Many people – including students who should know better – often write “ it’s ”
instead of “ its ”.
‍
“ It’s ” is the cliticized form of “it is”. “ Its ” is the adjective meaning “of it”.
• Shop owners sometimes wrongly include apostrophes in plurals – “potato’s”.
‍
There’s a strong but misguided tendency to insert an apostrophe when pluralizing
unusual words such as acronyms, as in “PDF’s”. It’s perfectly fine to write “PDFs”!
• Many people think that the possessive of (e.g.) “James” is “James’ ”, when really
it should be “James’s” normally.
Special Practical Aspects of Morphology
•
In informal English and particularly in computer-mediated chat, repetition of
letters is used for emphasis of meaning, as in “baaaad”, “grrrrrand”, “grrrrr”,
“hmmmm”, “oooooh”.
Although the repeated letter is not itself a morpheme, letter repetition could be
said to be a morphological process as it fairly systematically changes meaning.
• Capitalization of all or parts of words for emphasis could perhaps be said to be a
morphological process, though this would probably cause arguments!
• Exercise: when only parts are capitalized, what sort of part do they tend to be?
• The phenomena on this slide (plus repetition of exclamation and question
marks) are very important for the practical processing of internet chat,
etc.
[Morphological Processing, contd.]
• Derivational morphology
– Can be used to reduce the number of separate word forms to be stored.
– Eg, given an entry for the base form of the verb sing, then use rules to map the
nouns singer and singers onto the same entry.
• Derivational morphology is particularly useful for Machine
Translation (MT) ...
[Morphological Processing, contd.]
• In either single-language or MT systems, words may actually be
previously-unseen words or actual neologisms (newly invented
words). E.g.:
– Neologisms often have a proper name as their root. A knowledge of how
Thatcherite and Blairism were formed from proper names could, e.g., enable an
MT system to translate them into an idiomatic equivalent in the target language.
[Morphological Processing, contd.]
• Previously-unseen words or neologisms in MT, contd:
– Analyser reduces these words to their base form.
– It may be able to translate the base form
– It may be able then (in effect) to coin a word in the target language by
simply following rules.
Small Morphological Analyser:
Prolog Implementation of the above FSN
(OPTIONAL MATERIAL)
Small Morphological Analyser:
Prolog Implementation of the FSN
•
Syntactic category result for a word is e.g.: noun(cat), verb(trust)
•
Features result for a word is of form
•
Features(Number, Person, Tense, Participle, Extra)
– where e.g. Number is e.g. the term: number(singular)
– The Extra parameter was included for future expansion.
– Parameters that are inappropriate for a word are left unbound in the result.
•
The syntactic category and appropriate features info must be returned every time the
FSN gets to a final state.
•
States are identified by integers in the program, with 1 being the initial state and 9999
the final state.
•
States from 8000 onwards are used for cases where a regular word stem has been
found, so arcs from it deal with regular endings.
Small Morphological Analyser:
The Controller Code
Here: http://www.cs.bham.ac.uk/~jab/Modules/NLP1/10-11/Tools/
file: fsn1morph.pl
%% - We can use this controller as follows:
| ?- morph(1,Word,Features,[g,i,r,l], []).
| ?- morph(1,Word,Features,[t,r,u,s,t,e,d],[]).
% 1 - terminating condition
morph(State, Word, Features, S, S0) :final_state(Final),
arc(State, Final, S, S0, Word, Features).
final_state(9999).
% 2 - recursive condition
morph(State, Word, Features, S, S0) :arc(State, Next, S, S1, Word, Features),
morph(Next, Word, Features, S1, S0).
Small Morphological Analyser:
Examples of Coding of the FSN’s Arcs
arc(1,
arc(1,
arc(1,
arc(2,
arc(4,
arc(5,
2,
4,
10,
3,
5,
6,
[c|S],
[g|S],
[t|S],
[a|S],
[i|S],
[r|S],
S,
S,
S,
S,
S,
S,
_Word,
_Word,
_Word,
_Word,
_Word,
_Word,
arc(3, 8000, [t|S], S,
arc(6, 8000, [l|S], S,
_Features).
_Features).
_Features).
_Features).
_Features).
_Features).
noun(cat), _Features).
noun(girl), _Features).
 arc(8000, 9999, S, S0, _Word,

features(number(singular),person(_),_Ten,_Part, _)) :
punctuation(S, S0).
arc(11, 9999, [e|S], S0, det(the),
features(_Numb, person(third),_Ten,_Part,_)) :punctuation(S, S0).
Small Morphological Analyser:
Examples of Coding of the FSN’s Arcs, contd
% Coding for the plural form:
arc(8000, 8001, [s|S], S, _Word, _Features).
arc(8001, 9999, S, S0, _Word,
features(number(plural), person(third),_Ten, _Part, _)):punctuation(S, S0).