M>0 - University of Memphis

Download Report

Transcript M>0 - University of Memphis

Natural Language Processing
Vasile Rus
http://www.cs.memphis.edu/~vrus/teaching/nlp
Outline
• Announcements
• Word-level Processing
• Stemming
Announcements
• Paper presentations (PhD students)
• Project
Language
• Language = words grouped according to
some rules called a grammar
Language = words + rules
Rules are too flexible for system developers
Rules are not flexible enough for poets
4
Language
• Dictionary/Lexicon
– set of words defined in the language
– open (dynamic)
– entries in lexicon are called lemma
• Grammar
– set of rules which describe what is allowable in a language
– Classic Grammars
• meant for humans who know the language
• definitions and rules are mainly supported by examples
• no (or almost no) formal description tools; cannot be programmed
– Explicit Grammar (CFG, Dependency Grammars,...)
• formal description
• can be programmed & tested on data (texts)
Levels of (Formal) Description
• Speech/Character Recognition
– Speech: Phonetics and Phonology
handle words
• Morphology
•
•
•
•
Syntax
Semantics
Pragmatics
Discourse
handle
sounds/
graphics
handle rules for grouping
words in legal
language constructs
6
NLP Pipeline
speech
text
Phonetic Analysis
Character Recognition
Morphological analysis
Syntactic analysis
Semantic Interpretation
Discourse Processing
This
Lecture
Words and their Internal Affairs:
Morphology
• Words are grouped into classes/ grammatical
categories/ syntactic categories/parts-of-speech (POS)
based
– on their syntactic and morphological behavior
• Noun: words that occur with determiners, take possessives, occur
(most but not all) in plural form
– and less on their typical semantic type
• Luckily the classes are semantically coherent at some extent
• A word belongs to a category if it passes the
substitution test
– The sad/intelligent/green/fat bug submerged in my soup.
They all belong to the same class: ADJ
Words and their Internal Affairs:
Morphology
• Word categories are of two types:
– Open categories: accept new members
•
•
•
•
Nouns
Verbs
Adjectives
Adverbs
Any known human language
has nouns and verbs!
– Closed or functional categories
• Almost fixed membership
• Few members
• Determiners, prepositions, pronouns, conjunctions, auxiliary
verbs?, particles, numerals, etc.
• Play an important role in grammar
Phonology and Morphology
• Phonology: how sounds are realized in different contexts
• Phoneme: a set of closely related speech sounds (phones)
regarded as a single sound. For example, the sound of "r" in
red, bring, or round is a phoneme.
• Historical spelling
– night, nite
– attention, mission, fish
• Script Limitations
– Spoken English has 14 vowels
• heed hid hayed head had hoed hood who’d hide how’d taught Tut toy
enough
– English Alphabet has 5
• Use vowel combinations: far fair fare
• Consonantal doubling (hopping vs. hoping)
Syntax and Morphology
• Phrase-level agreement
– Subject-Verb
• John studies hard (STUDY+3SG)
– Noun-Adjective
• Las vacas hermosas
Morphology: Morphemes
• Studies how words are built up from morpheme, the
minimal meaning bearing unit
– Example: foxes - fox + es
– could as well be some “ID” numbers:
• e.g. fox ~ 2327, es ~ 1278
• Two broad classes of morphemes:
– Stem: the main morpheme of a word (supplies the main
meaning)
– Affixes: provide additional meanings
•
•
•
•
prefixes: precede the stem
suffixes: follow the stem
infixes: inside the stem
circumfixes: precede and follow the stem
Morphology
• Morpheme combine according to some fixed rules
– Concatenative morphology
• Prefixes and suffixes
• Morphemes combined in complex ways
– Non-concatenative/templatic morphology
• Morphology can be divided up into two broad classes
– Inflectional:
– Derivational
Inflectional Morphology
• Inflectional morphology concerns the
combination of stems and affixes where the
resulting word
– Has the same word class as the original
– Serves a grammatical/semantic purpose different
from the original
Nouns (English)
• Nouns are simple (not really)
– Markers for plural and possessive
• Plural:
– Regular plural: affix –s or -es
• Cat – cats
• Thrush – thrushes
– Irregular plural:
• Mouse – mice
• Ox - oxen
Verbs
• Only main and primary (be, have, do) verbs
have inflectional affixes (modals don’t have)
– Regular
• Walk, walks, walking, walked, walked
– Irregular
• Eat, eats, eating, ate, eaten
• Catch, catches, catching, caught, caught
• Cut, cuts, cutting, cut, cut
Regular Verbs
Morphological
Form Classes
Regularly Inflected Verbs
Stem
Walk
Merge
Try
Map
-s form
Walks
Merges
Tries
Maps
-ing participle
Walking Merging
Trying
Mapping
Past form or –ed
participle
Walked Merged
Tried
mapped
Irregular Verbs
Morphological
Form Classes
Irregularly Inflected Verbs
Stem
Eat
Catch
Cut
-s form
Eats
Catches
Cuts
-ing participle
Eating
Catching
Cutting
Past form
Ate
Caught
Cut
–ed participle
Eaten
Caught
cut
Derivational Morphology
• Combination of a word stem with a grammatical
morpheme
• Results in a word of a different class
• Derivational morphology is complex in English
– Quasi-systematicity
– Irregular meaning change
– Changes of word class
• Nominalisation
– Computerize + ation - computerization
Derivational Examples
• Verb/Adj to Noun
-ation
computerize
computerization
-ee
-er
-ness
appoint
kill
fuzzy
appointee
killer
fuzziness
Derivational Examples
• Noun/Verb to Adj
-al
Computation
Computational
-able
Embrace
Embraceable
-less
Clue
Clueless
Compute
• Many paths are possible…
• Start with compute
–
–
–
–
Computer -> computerize -> computerization
Computation -> computational
Computer -> computerize -> computerizable
Compute -> computee
Computational Morphology
• Morphological Parsing
– Maps surface representation to lexicon entries
– Finite State Morphology
• Finite State Transducers (FST)
• Input/Output
• Analysis/Generation
Computational Morphology
WORD
• Cats
• cat
• cities
• geese
• ducks
STEM (+FEATURES)*
cat +N +PL
cat +N +SG
city +N +PL
goose +N +PL
(duck +N +PL) or
(duck +V +3SG)
• merging merge +V +PRES-PART
• caught (catch +V +PAST-PART) or
(catch +V +PAST)
Computational Morphology
• The Rules and the Lexicon
–
–
–
–
General versus Specific
Regular versus Irregular
Accuracy, speed, space
The Morphology of a language
• Approaches
– Lexicon only
– Rules only
– Lexicon and Rules
• Finite-state Transducers: a Finite-state Automata that
maps between lexical and surface level
Lexicon-only Morphology
• The lexicon lists all surface level and lexical level pairs
• No rules …?
• Analysis/Generation is easy
• Very large for English
• What about Arabic or Turkish?
• Chinese?
acclaim
acclaim
acclaimed
acclaimed
acclaiming
acclaims
acclaims
acclamation
acclamations
acclimate
acclimated
acclimated
acclimates
acclimating
acclaim $N$
acclaim $V+0$
acclaim $V+ed$
acclaim $V+en$
acclaim $V+ing$
acclaim $N+s$
acclaim $V+s$
acclamation
acclamation
acclimate
acclimate
acclimate
acclimate
acclimate
$N$
$N+s$
$V+0$
$V+ed$
$V+en$
$V+s$
$V+ing$
Lexicon and Rules
FSA Inflectional Morphology
• English Noun Lexicon
reg-noun
Irreg-pl-noun
Irreg-sg-noun
plural
fox
cat
dog
geese
sheep
mice
goose
sheep
mouse
-s
• English Noun
Rule
Rules Only: Stemming
• Sometimes you just need to know the stem of a word
and you don’t care about the structure
• In fact you may not even care if you get the right
stem, as long as you get a consistent string
• This is stemming… it most often shows up in IR
(Information Retrieval) applications
• IR: the task of locating most relevant documents in a
collection given a query as a set of keywords
Stemming in IR
• Run a stemmer on the documents to be
indexed
• Run a stemmer on users queries
• Match
Porter Stemmer
•
•
•
•
No lexicon needed
Basically a set of staged sets of rewrite rules that strip suffixes
Handles both inflectional and derivational suffixes
Doesn’t guarantee that the resulting stem is really a stem
– Lack of guarantee doesn’t matter for IR
– In a test on the well-known Cranfield 200 collection it gave an
improvement in retrieval performance when compared with a very
much more elaborate program which has been in use in IR research in
Cambridge since 1971
• DAWSON, J.L. Suffix Removal and Word Conflation. ALLC Bulletin,
Michaelmas 1974 p.33-46
• ANDREWS, K. The Development of a Fast Conflation Algorithm for
English. Dissertation for the Diploma in Computer Science, Computer
Laboratory, University of Cambridge, 1971
Porter Stemmer: Examples
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
wear
wearable
wearer
wearied
wearier
weariest
wearily
weariness
wearing
wearisome
wearisomely
wears
weather
weathercock
weathercocks
wear
wearabl
wearer
weari
wearier
weariest
wearili
weari
wear
wearisom
wearisom
wear
weather
weathercock
weathercock
•
•
•
•
•
•
•
•
•
•
•
•
•
web
Webb
Webber
webs
Webster
Websterville
wedded
wedding
weddings
wedge
wedged
wedges
wedging
web
webb
webber
web
webster
webstervil
wedd
wedd
wedd
wedg
wedg
wedg
wedg
Definitions
• C = consonant = Not A E I O U or (Y preceded by consonant)
• V = not C
• Every word (or part of a word) has one of those forms
–
–
–
–
CVCV … C
CVCV … V
VCVC … C
VCVC … V
• C* means zero or more consonants; similarly V*
• C+ means one or more consonants; similarly V+
• M = Measure of sequences of V+C+: Words = C*(V+C+){M}V*
M=0 TR, EE, TREE, Y, BY
M=1 TROUBLE, OATS, TREES, IVY
M=2 TROUBLES, PRIVATE, OATEN, ORRERY
Definitions
• Conditions
*S - stem ends with S - (and similarly for the other
letters)
*v* - stem contains a V
*d - stem ends with double C
e.g. -DD, -ZZ
*o - stem ends CVC, where the second C is not W,
X or Y
e.g. -WIL, -SOB
Stemming Rules
• The rules for removing a suffix will be given in the form
(condition) S1 -> S2
• This means: if a word ends with the suffix S1, and the
stem before S1 satisfies the given condition, S1 is
replaced by S2
• The condition is usually given in terms of m
(m > 1) EMENT ->
Here S1 is `EMENT' and S2 is null. This would map REPLACEMENT to
REPLAC, since REPLAC is a word part for which m = 2.
• the condition part may also contain expressions with
and, or and not
– (m>1 and (*S or *T)) tests for a stem with m>1 ending in S or T
– (*d and not (*L or *S or *Z)) tests for a stem ending with a double
consonant other than L, S or Z
Stemming Rules
• In a set of rules written beneath each other, only
one is obeyed, and this will be the one with the
longest matching S1 for the given word
• For example, with
SSES -> SS
IES -> I
SS -> SS
S ->
CARESSES maps to CARESS since SSES is the
longest match for S1. Equally CARESS maps to
CARESS (S1=`SS') and CARES to CARE (S1=`S')
*<S> = ends with <S>
Porter Stemmer
*v* = contains a V
*d = ends with double C
*o = ends with CVC
second C is not W, X or Y
Step 1: Plural Nouns and Third Person Singular Verbs
SSES  SS
IES  I
caresses  caress
ponies  poni
ties  ti
caress  caress
cats  cat
SS  SS
S
Step 2a: Verbal Past Tense and Progressive Forms
(M>0) EED  EE

i
(*v*) ED
ii
(*v*) ING 
feed  feed, agreed  agree
plastered 
motoring
plaster, bled  bled
 motor, sing  sing
Step 2b: If 2a.i or 2a.ii is successful, Cleanup
AT  ATE
conflat(ed)  conflate
BL  BLE
troubl(ed)  trouble
IZ  IZE
siz(ed)  size
(*d and not (*L or *S or *Z))
 single letter
(M=1 and *o)  E
hopp(ing) 
hop, tann(ed)  tan
hiss(ing)  hiss, fizz(ed)  fizz
fail(ing)  fail, fil(ing)  file
*<S> = ends with <S>
Porter Stemmer
*v* = contains a V
*d = ends with double C
*o = ends with CVC
second C is not W, X or Y
Step 3: Y  I
(*v*) Y  I
happy  happi
sky  sky
*<S> = ends with <S>
Porter Stemmer
*v* = contains a V
*d = ends with double C
*o = ends with CVC
second C is not W, X or Y
Step 4: Derivational Morphology I
(m>0) ATIONAL -> ATE
(m>0) TIONAL -> TION
(m>0) ENCI -> ENCE
(m>0) ANCI -> ANCE
(m>0) IZER -> IZE
(m>0) ABLI -> ABLE
(m>0) ALLI -> AL
(m>0) ENTLI -> ENT
(m>0) ELI -> E
(m>0) OUSLI -> OUS
(m>0) IZATION -> IZE
(m>0) ATION -> ATE
(m>0) ATOR -> ATE
(m>0) ALISM -> AL
(m>0) IVENESS -> IVE
(m>0) FULNESS -> FUL
(m>0) OUSNESS -> OUS
(m>0) ALITI -> AL
(m>0) IVITI -> IVE
(m>0) BILITI -> BLE
relational -> relate
conditional -> condition
rational
-> rational
valenci
-> valence
hesitanci
-> hesitance
digitizer
-> digitize
conformabli -> conformable
radicalli
-> radical
differentli -> different
vileli
- > vile
analogousli -> analogous
vietnamization -> vietnamize
predication -> predicate
operator
-> operate
feudalism -> feudal
decisiveness -> decisive
hopefulness -> hopeful
callousness -> callous
formaliti
-> formal
sensitiviti -> sensitive
sensibiliti -> sensible
*<S> = ends with <S>
Porter Stemmer
*v* = contains a V
*d = ends with double C
*o = ends with CVC
second C is not W, X or Y
Step 5: Derivational Morphology II: More Suffixes
(m>0) ICATE -> IC
(m>0) ATIVE ->
(m>0) ALIZE -> AL
(m>0) ICITI -> IC
(m>0) ICAL -> IC
(m>0) FUL ->
(m>0) NESS ->
triplicate -> triplic
formative -> form
formalize -> formal
electriciti -> electric
electrical -> electric
hopeful
-> hope
goodness
-> good
*<S> = ends with <S>
Porter Stemmer
*v* = contains a V
*d = ends with double C
*o = ends with CVC
second C is not W, X or Y
Step 5: Derivational Morphology III: Even More Suffixes
(m>1) AL ->
(m>1) ANCE ->
(m>1) ENCE ->
(m>1) ER ->
(m>1) IC ->
(m>1) ABLE ->
(m>1) IBLE ->
(m>1) ANT ->
(m>1) EMENT ->
(m>1) MENT ->
(m>1) ENT ->
(m>1 and (*S or *T)) ION ->
(m>1) OU ->
(m>1) ISM ->
(m>1) ATE ->
(m>1) ITI ->
(m>1) OUS ->
(m>1) IVE ->
(m>1) IZE ->
revival
-> reviv
allowance -> allow
inference -> infer
airliner
-> airlin
gyroscopic -> gyroscop
adjustable -> adjust
defensible -> defens
irritant
-> irrit
replacement -> replac
adjustment -> adjust
dependent -> depend
adoption
-> adopt
homologou -> homolog
communism -> commun
activate
-> activ
angulariti -> angular
homologous -> homolog
effective -> effect
bowdlerize -> bowdler
*<S> = ends with <S>
Porter Stemmer
*v* = contains a V
*d = ends with double C
*o = ends with CVC
second C is not W, X or Y
Step 7a: Cleanup
(m>1) E 
(m=1 and not *o) E 
probate  probat
rate  rate
cease  ceas
Step 7b: More Cleanup
(m > 1 and *d and *L)
 single letter
controll  control
roll  roll
Some Insights
• The algorithm is careful not to remove a suffix when the stem is too short,
the length of the stem being given by its measure, m
• For example, in the following two lists:
list A
-----RELATE
PROBATE
CONFLATE
PIRATE
PRELATE
list B
-----DERIVATE
ACTIVATE
DEMONSTRATE
NECESSITATE
RENOVATE
-ATE is removed from the list B words, but not from the list A words.
• The fact that no attempt is made to identify prefixes can make the results
look rather inconsistent
– PRELATE does not lose the -ATE, but ARCHPRELATE becomes
ARCHPREL
– in practice this does not matter too much, because the presence of the prefix
decreases the probability of an erroneous conflation
More Insights
•
Complex suffixes are removed bit by bit in the different steps
– GENERALIZATIONS -> GENERALIZATION (Step 1) -> GENERALIZE (Step 2) > GENERAL (Step 3) -> GENER (Step 4)
– OSCILLATORS -> OSCILLATOR (Step 1) -> OSCILLATE (Step 2) -> OSCILL
(Step 4) -> OSCIL (Step 5)
•
In a vocabulary of 10,000 words, the reduction in size of the stem was
distributed among the steps as follows:
Suffix stripping of a vocabulary of 10,000 words
-----------------------------------------------Number of words reduced in step 1: 3597
" 2: 766
" 3: 327
" 4: 2424
" 5: 1373
Number of words not reduced: 3650
The resulting vocabulary of stems contained 6370 distinct entries (reduced
the size of the vocabulary by about one third).
Problems with Stemming
Errors by the Porter Stemmer
Organization
Organ
Doing
Doe
Generalization
Generic
Numerical
Numerous
Policy
Police
University
Universe
Easy
Easily
Addition
Additive
Negligible
Negligent
Execute
Executive
Define
Definite
Past
paste
Morphology: From Morphemes to
Lemmas
• Lemma : lexical unit, “pointer” to lexicon
– Set of lexical forms having same stem, same major
part-of-speech, and same word sense
– might as well be a number, but typically is
represented as the “base form”, or “dictionary
headword”
• possibly indexed when ambiguous/polysemous: state1
(verb), state2 (state- of -affairs), state3 (government)
Phones vs Phonemes
A phone is …
A phoneme is …
One of many possible sounds in the
languages of the world.
A contrastive unit in the sound system of a
particular language.
The smallest identifiable unit found in a
stream of speech.
A minimal unit that serves to distinguish
between meanings of words.
Pronounced in a defined way.
Pronounced in one or more ways,
depending on the number of allophones.
Represented between brackets by
convention.
Example: [b], [j], [o]
Represented between slashes by
convention.
Example: /b/, /j/, /o/
Lexeme, Morpheme, Phoneme
Syntax
Lexeme/Inflected
Lexeme
Grammars
sentences
Morphology
Morpheme/Allomorph
Morphotactics
words
Phonology
Phoneme/Allophone
Phonotactics
letters
Lexeme: individual entry in lexicon
Allomorph is a variant form of a morpheme. The meaning remains the
same, while the sound can vary. (-ed in fished is [t], in buzzed is [d])
Allophone is one of several similar phones or speech sounds, that
belong to the same phoneme. Each allophone is used in a specific
phonetic context.
Summary
• Morphology
• Stemming
– Porter’s Stemmer; see the link to the code on
the class web page
Next Time
• Part of Speech Tagging