Transcript Lecture 4
CS60057
Speech &Natural Language
Processing
Autumn 2005
Lecture 4
28 July 2005
Lecture 3, 7/27/2005
Natural Language Processing
1
Morphology
Study of the rules that govern the combination of
morphemes.
Inflection: same word, different syntactic information
Run/runs/running, book/books
Derivation: new word, different meaning
Often different part of speech, but not always
Possible/possibly/impossible, happy/happiness
Compounding: new word, each part is a word
Blackboard, bookshelf
lAlakamala, banabAsa
Lecture 3, 7/27/2005
Natural Language Processing
2
Morphology Level: The Mapping
Formally: A+ 2(L,C1,C2,...,Cn)
A is the alphabet of phonemes (A+ denotes any non-empty
sequence of phonemes)
L is the set of possible lemmas, uniquely identified
Ci are morphological categories, such as:
grammatical number, gender, case
person, tense, negation, degree of comparison, voice, aspect, ...
tone, politeness, ...
part of speech (not quite morphological category, but...)
A, L and Ci are obviously language-dependent
3
Bengali/Hindi Inflectional Morphology
Certain languages encode more syntax in morphology than in
syntax
Some of inflectional suffixes that nouns can have:
singular/plural :
Gender
possessive markers :
case markers :
Different Karakas
Inflectional suffixes that verbs can have:
Hindi: Tense, aspect, modality, person, gender, number
Bengali: Tense, aspect, modality, person
Order among inflectional suffixes (morphotactics )
Chhelederke
baigulokei
Lecture 3, 7/27/2005
Natural Language Processing
4
Bengali/ Hindi Derivational Morphology
Derivational morphology is very rich.
Lecture 3, 7/27/2005
Natural Language Processing
5
English Inflectional Morphology
Nouns have simple inflectional morphology.
plural -- cat / cats
possessive -- John / John’s
Verbs have slightly more complex inflectional, but still relatively
simple inflectional morphology.
past form -- walk / walked
past participle form -- walk / walked
gerund -- walk / walking
singular third person -- walk / walks
Verbs can be categorized as:
main verbs
modal verbs -- can, will, should
primary verbs -- be, have, do
Regular and irregular verbs: walk / walked -- go / went
Lecture 3, 7/27/2005
Natural Language Processing
6
Regulars and Irregulars
Some words misbehave (refuse to follow the rules)
Mouse/mice, goose/geese, ox/oxen
Go/went, fly/flew
The terms regular and irregular will be used to refer to
words that follow the rules and those that don’t.
Lecture 3, 7/27/2005
Natural Language Processing
7
Regular and Irregular Verbs
Regulars…
Walk, walks, walking, walked, walked
Irregulars
Eat, eats, eating, ate, eaten
Catch, catches, catching, caught, caught
Cut, cuts, cutting, cut, cut
Lecture 3, 7/27/2005
Natural Language Processing
8
Derivational Morphology
Quasi-systematicity
Irregular meaning change
Changes of word class
renationalizationability
Some English derivational affixes
-ation : transport / transportation
-er : kill / killer
-ness : fuzzy / fuzziness
-al : computation / computational
-able : break / breakable
-less : help / helpless
un : do / undo
re : try / retry
Lecture 3, 7/27/2005
Natural Language Processing
9
Derivational Examples
Verb/Adj to Noun
-ation
computerize
computerization
-ee
appoint
appointee
-er
kill
killer
-ness
fuzzy
fuzziness
Lecture 3, 7/27/2005
Natural Language Processing
10
Derivational Examples
Noun/Verb to Adj
-al
Computation
Computational
-able
Embrace
Embraceable
-less
Clue
Clueless
Lecture 3, 7/27/2005
Natural Language Processing
11
Compute
Many paths are possible…
Start with compute
Computer -> computerize -> computerization
Computation -> computational
Computer -> computerize -> computerizable
Compute -> computee
Lecture 3, 7/27/2005
Natural Language Processing
12
Parts of A Morphological Processor
For a morphological processor, we need at least
followings:
Lexicon : The list of stems and affixes together with basic
information about them such as their main categories (noun,
verb, adjective, …) and their sub-categories (regular noun,
irregular noun, …).
Morphotactics : The model of morpheme ordering that explains
which classes of morphemes can follow other classes of
morphemes inside a word.
Orthographic Rules (Spelling Rules) : These spelling rules are
used to model changes that occur in a word (normally when two
morphemes combine).
Lecture 3, 7/27/2005
Natural Language Processing
13
Lexicon
A lexicon is a repository for words (stems).
They are grouped according to their main categories.
noun, verb, adjective, adverb, …
They may be also divided into sub-categories.
regular-nouns, irregular-singular nouns, irregular-plural nouns, …
The simplest way to create a morphological parser, put all possible
words (together with its inflections) into a lexicon.
We do not this because their numbers are huge (theoretically for
Turkish, it is infinite)
Lecture 3, 7/27/2005
Natural Language Processing
14
Morphotactics
Which morphemes can follow which morphemes.
Lexicon:
regular-noun
irregular-pl-noun
fox
cat
dog
geese
sheep
mice
irreg-sg-noun
goose
sheep
mouse
plural
-s
Simple English Nominal Inflection (Morphotactic Rules)
reg-noun
0
1
plural (-s)
irreg-sg-noun
2
irreg-pl-noun
Lecture 3, 7/27/2005
Natural Language Processing
15
Combine Lexicon and Morphotactics
o
x
f
c
d
a
t
o
g
s
s
h
e
e
e
g
o
o
m
p
s
e
e
o
u
i
e
s
c
This only says yes or no. Does not give lexical representation.
It accepts a wrong word (foxs).
Lecture 3, 7/27/2005
Natural Language Processing
16
FSAs and the Lexicon
This will actual require a kind of FSA : the Finite State
Transducer (FST)
We will give a quick overview
First we’ll capture the morphotactics
The rules governing the ordering of affixes in a language.
Then we’ll add in the actual words
Lecture 3, 7/27/2005
Natural Language Processing
17
Simple Rules
Lecture 3, 7/27/2005
Natural Language Processing
18
Adding the Words
Lecture 3, 7/27/2005
Natural Language Processing
19
Derivational Rules
Lecture 3, 7/27/2005
Natural Language Processing
20
Parsing/Generation
vs. Recognition
Recognition is usually not quite what we need.
Usually if we find some string in the language we need to find the
structure in it (parsing)
Or we have some structure and we want to produce a surface form
(production/generation)
Example
From “cats” to “cat +N +PL”
Lecture 3, 7/27/2005
Natural Language Processing
21
Why care about morphology?
`Stemming’ in information retrieval
Might want to search for “aardvark” and find pages with both
“aardvark” and “aardvarks”
Morphology in machine translation
Need to know that the Spanish words quiero and quieres are
both related to querer ‘want’
Morphology in spell checking
Need to know that misclam and antiundoggingly are not words
despite being made up of word parts
Lecture 3, 7/27/2005
Natural Language Processing
22
Can’t just list all words
Turkish word Uygarlastiramadiklarimizdanmissinizcasin
`(behaving) as if you are among those whom we could not civilize’
Uygar
las
tir
ama
dik
lar
imiz
dan
mis
siniz
casina
Lecture 3, 7/27/2005
`civilized’ +
`become’ +
`cause’ +
`not able’ +
`past’ +
‘plural’+
‘p1pl’ +
‘abl’ +
‘past’ +
‘2pl’ +
‘as if’
Natural Language Processing
23
Finite State Transducers
The simple story
Add another tape
Add extra symbols to the transitions
On one tape we read “cats”, on the other we write “cat +N
+PL”
Lecture 3, 7/27/2005
Natural Language Processing
24
Transitions
c:c
a:a
t:t
+N:ε
+PL:s
c:c means read a c on one tape and write a c on the other
+N:ε means read a +N symbol on one tape and write nothing on the other
+PL:s means read +PL and write an s
Lecture 3, 7/27/2005
Natural Language Processing
25
Lexical to Intermediate Level
Lecture 3, 7/27/2005
Natural Language Processing
26
FST Properties
FSTs are closed under: union, inversion, and composition.
union : The union of two regular relations is also a regular relation.
inversion : The inversion of a FST simply switches the input and
output labels.
This means that the same FST can be used for both directions of a
morphological processor.
composition : If T1 is a FST from I1 to O1 and T2 is a FST from O1
to O2, then composition of T1 and T2 (T1oT2) maps from I1 to O2.
We use these properties of FSTs in the creation of the FST for a
morphological processor.
Lecture 3, 7/27/2005
Natural Language Processing
27
A FST for Simple English Nominals
+N: є
reg-noun
irreg-sg-noun
+N: є
irreg-pl-noun
+S:#
+PL:^s#
+SG:#
+PL:#
+N: є
Lecture 3, 7/27/2005
Natural Language Processing
28
FST for stems
A FST for stems which maps roots to their root-class
reg-noun
irreg-pl-noun
irreg-sg-noun
fox
cat
dog
g o:e o:e se
sheep
m o:i u:є s:c e
goose
sheep
mouse
fox stands for f:f o:o x:x
When these two transducers are composed, we have a FST which maps
lexical forms to intermediate forms of words for simple English noun
inflections.
Next thing that we should handle is to design the FSTs for orthographic
rules, and combine all these transducers.
Lecture 3, 7/27/2005
Natural Language Processing
29
Multi-Level Multi-Tape Machines
A frequently use FST idiom, called cascade, is to have the output of one
FST read in as the input to a subsequent machine.
So, to handle spelling we use three tapes:
lexical, intermediate and surface
We need one transducer to work between the lexical and intermediate
levels, and a second (a bunch of FSTs) to work between intermediate and
surface levels to patch up the spelling.
lexical
Lecture 3, 7/27/2005
d
o
g
+N +PL
intermediate
d
o
g
^
surface
d
o
g
s
Natural Language Processing
s #
30
Lexical to Intermediate FST
Lecture 3, 7/27/2005
Natural Language Processing
31
Orthographic Rules
We need FSTs to map intermediate level to surface level.
For each spelling rule we will have a FST, and these FSTs run parallel.
Some of English Spelling Rules:
consonant doubling -- 1-letter consonant doubled before ing/ed -beg/begging
E deletion - Silent e dropped before ing and ed -- make/making
E insertion -- e added after s, z, x, ch, sh before s -- watch/watches
Y replacement -- y changes to ie before s, and to i before ed -- try/tries
K insertion -- verbs ending with vowel+c we add k -- panic/panicked
We represent these rules using two-level morphology rules:
a => b / c__d
rewrite a as b when it occurs between c and d.
Lecture 3, 7/27/2005
Natural Language Processing
32
FST for E-Insertion Rule
E-insertion rule: є => e / {x,s,z}^ __ s#
Lecture 3, 7/27/2005
Natural Language Processing
33
Generating or Parsing with FST Lexicon
and Rules
Lecture 3, 7/27/2005
Natural Language Processing
34
Accepting Foxes
Lecture 3, 7/27/2005
Natural Language Processing
35
Intersection
We can intersect all rule FSTs to create a single FST.
Intersection algorithm just takes the Cartesian product of
states.
For each state qi of the first machine and qj of the second
machine, we create a new state qij
For input symbol a, if the first machine would transition to state
qn and the second machine would transition to qm the new
machine would transition to qnm.
Lecture 3, 7/27/2005
Natural Language Processing
36
Composition
Cascade can turn out to be somewhat pain.
it is hard to manage all tapes
it fails to take advantage of restricting power of the machines
So, it is better to compile the cascade into a single large machine.
Create a new state (x,y) for every pair of states x є Q1 and y є Q2.
The transition function of composition will be defined as follows:
δ((x,y),i:o) = (v,z) if
there exists c such that δ1(x,i:c) = v
and δ2(y,c:o) = z
Lecture 3, 7/27/2005
Natural Language Processing
37
Intersect Rule FSTs
lexical tape
LEXICON-FST
intermediate tape
FST1 … FSTn
=> FSTR = FST1 ^ … ^ FSTn
surface tape
Lecture 3, 7/27/2005
Natural Language Processing
38
Compose Lexicon and Rule FSTs
lexical tape
lexical tape
LEXICON-FST
intermediate tape
=> LEXICON-FST o FSTR
FSTR = FST1 ^ … ^ FSTn
surface level
surface tape
Lecture 3, 7/27/2005
Natural Language Processing
39
Porter Stemming
Some applications (some informational retrieval
applications) do not the whole morphological processor.
They only need the stem of the word.
A stemming algorithm (Port Stemming algorithm) is a
lexicon-free FST.
It is just a cascaded rewrite rules.
Stemming algorithms are efficient but they may introduce
errors because they do not use a lexicon.
Lecture 3, 7/27/2005
Natural Language Processing
40