Morphology and Stemming

Download Report

Transcript Morphology and Stemming

I256:
Applied Natural Language Processing
Marti Hearst
Sept 11, 2006
1
Elements of Language
Today: Morphology
Illustration from http://www.departments.bucknell.edu/linguistics/lectures/05lect02.html
2
3
4
Jabberwocky Analysis
This is nonsense … or is it?
This is not English … but it’s much more like English
than it is like French or German or Chinese or …
Why do we pretty much understand the words?
5
Jabberwocky Analysis
Why do we pretty much understand the words?
We recognize combinations of morphemes.
Chortled - Laugh in a breathy, gleeful way;
(Definition from Oxford American Dictionary) A
combination of "chuckle" and "snort."
Galumphing - Moving in a clumsy, ponderous, or
noisy manner. Perhaps a blend of "gallop" and
"triumph." (Definition from Oxford American
Dictionary)
Activity:
Make up a word whose meaning can be inferred from
the morphemes that you used.
6
Jabberwocky Analysis
Why do we pretty much understand the words?
Surrounding English words strongly indicate the
parts-of-speech of the nonsense words.
toves: probably can perform an action
(because they did gyre and gimble)
wabe: is probably a place.
(they did … in the wabe)
http://assets.cambridge.org/052185/542X/excerpt/052185542X_excerpt.pdf
7
Jabberwocky Analysis
Surrounding English words strongly indicate the
parts-of-speech of the nonsense words.
It’s similar in the French Translation:
Example from http://www.departments.bucknell.edu/linguistics/lectures/05lect02.html
8
Morphology
Morphology:
The study of the way words are built up from smaller
meaning units.
Morphemes:
The smallest meaningful unit in the grammar of a language.
Contrasts:
Derivational vs. Inflectional
Regular vs. Irregular
Concatinative vs. Templatic (root-and-pattern)
A useful resource:
Glossary of linguistic terms by Eugene Loos
http://www.sil.org/linguistics/GlossaryOfLinguisticTerms/contents.h
tm
Modified from Dorr and Habash (after Jurafsky and Martin)
9
Examples (English)
“unladylike”
3 morphemes, 4 syllables
unlady
-like
‘not’
‘(well behaved) female adult human’
‘having the characteristics of’
Can’t break any of these down further without
distorting the meaning of the units
“technique”
1 morpheme, 2 syllables
“dogs”
2 morphemes, 1 syllable
-s, a plural marker on nouns
Modified from Dorr and Habash (after Jurafsky and Martin)
10
Morpheme Definitions
Root
The portion of the word that:
– is common to a set of derived or inflected forms, if any, when
all affixes are removed
– is not further analyzable into meaningful elements
– carries the principle portion of meaning of the words
Stem
The root or roots of a word, together with any derivational
affixes, to which inflectional affixes are added.
Affix
A bound morpheme that is joined before, after, or within a
root or stem.
Clitic
a morpheme that functions syntactically like a word, but
does not appear as an independent phonological word
– Spanish: un beso, las aguas
– English: Hal’s (genetive marker)
Modified from Dorr and Habash (after Jurafsky and Martin)
11
Inflectional vs. Derivational
Word Classes
Parts of speech: noun, verb, adjectives, etc.
Word class dictates how a word combines with morphemes to
form new words
Inflection:
Variation in the form of a word, typically by means of an
affix, that expresses a grammatical contrast.
– Doesn’t change the word class
– Usually produces a predictable, nonidiosyncratic change of
meaning.
 run -> runs | running | ran
Derivation:
The formation of a new word or inflectable stem from another
word or stem.
– compute -> computer -> computerization
Modified from Dorr and Habash (after Jurafsky and Martin)
12
Inflectional Morphology
Adds:
tense, number, person, mood, aspect
Word class doesn’t change
Word serves new grammatical role
Examples
come is inflected for person and number:
The pizza guy comes at noon.
las and rojas are inflected for agreement with
manzanas in grammatical gender by -a and in
number by –s
las manzanas rojas
(‘the red apples’)
Modified from Dorr and Habash (after Jurafsky and Martin)
13
Derivational Morphology
Nominalization (formation of nouns from other parts of speech,
primarily verbs in English):
computerization
appointee
killer
fuzziness
Formation of adjectives (primarily from nouns)
computational
clueless
Embraceable
Diffulcult cases:
building  from which sense of “build”?
Modified from Dorr and Habash (after Jurafsky and Martin)
14
Concatinative Morphology
Morpheme+Morpheme+Morpheme+…
Stems: also called lemma, base form, root, lexeme
hope+ing  hoping
hop  hopping
Affixes
Prefixes: Antidisestablishmentarianism
Suffixes: Antidisestablishmentarianism
Infixes: hingi (borrow) – humingi (borrower) in Tagalog
Circumfixes: sagen (say) – gesagt (said) in German
Agglutinative Languages
uygarlaştıramadıklarımızdanmışsınızcasına
uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına
Behaving as if you are among those whom we could not cause to become civilized
Modified from Dorr and Habash (after Jurafsky and Martin)
15
Templatic Morphology
Roots and Patterns
Example: Hebrew verbs
Root:
– Consists of 3 consonants CCC
– Carries basic meaning
Template:
– Gives the ordering of consonants and vowels
– Specifies semantic information about the verb
 Active, passive, middle voice
Example:
– lmd (to learn or study)
 CaCaC -> lamad (he studied)
 CiCeC -> limed (he taught)
 CuCaC -> lumad (he was taught)
Modified from Dorr and Habash (after Jurafsky and Martin)
16
Morphological Analysis Tools
Porter stemmer:
A simple approach: just hack off the end of the
word!
Frequently used, especially for Information Retrieval,
but results are pretty ugly!
17
porter.demo()
Original *****************************
Pierre Vinken , 61 years old , will join the board as a nonexecutive
director Nov. 29 . Mr. Vinken is chairman of Elsevier N.V. , the Dutch
publishing group . Rudolph Agnew , 55 years old and former chairman of
Consolidated Gold Fields PLC , was named a nonexecutive director of
this British industrial conglomerate . A form of asbestos once used to
make Kent cigarette filters has caused a high percentage of cancer
deaths among a group of workers exposed to it more than 30 years ago ,
researchers reported .
Results *******************************
Pierr Vinken , 61 year old , will join the board as a nonexecut
director Nov. 29 . Mr. Vinken is chairman of Elsevi N.V. , the Dutch
publish group . Rudolph Agnew , 55 year old and former chairman of
Consolid Gold Field PLC , wa name a nonexecut director of thi British
industri conglomer . A form of asbesto onc use to make Kent cigarett
filter ha caus a high percentag of cancer death among a group of
worker expos to it more than 30 year ago , research report .
18
Morphological Analysis Tools
WordNet’s morphy()
A slightly more sophisticated approach
Use an understanding of inflectional morphology
– Uses a set of Rules of Detachment
– Use an Exception List for irregulars
– Handle collocations in a special way
Do the transformation, compare the result to the
WordNet dictionary
If the transformation produces a real word, then
keep it, else use the original word.
For more details, see
– http://wordnet.princeton.edu/man/morphy.7WN.html
19
Some morphy() output
>>> wntools.morphy('dogs')
'dog'
>>> wntools.morphy('running', pos='verb')
'run'
>>> wntools.morphy('corpora')
'corpus'
>>>
20
Morphological Analysis Tools
Very sophisticated programs have been developed
Use a techniqued called Two-Level Phonology
Has been applied to numerous languages
Best known: PCKimmo
After Kimmo Koskenniemi, based in part on work by Lauri
Kartunnen in 1983
Uses:
– A rules file which specifies the alphabet and the phonological
(or spelling) rules,
– A lexicon file which lists lexical items and encodes
morphotactic constraints.
http://www.sil.org/pckimmo/
NLTK-Lite has a version too (not working in my download)
Commercial versions are available
inXight’s LinguistX version based on technology developed
by Kaplan and others from Xerox PARC (or at least used to be)
21
Morphological Analysis Tools
CatVar:
Categorial Variation Database
“A database of clusters of uninflected words
(lexemes) and their categorial (i.e. part-of-speech)
variants.”
Example: the developing cluster:(develop(V),
developer(N), developed(AJ), developing(N),
developing(AJ), development(N)).
http://clipdemos.umiacs.umd.edu/catvar
22
Next Time
Computing with n-grams
23