Transcript Lecture 4

CS60057
Speech &Natural Language
Processing
Autumn 2005
Lecture 4
28 July 2005
Lecture 3, 7/27/2005
Natural Language Processing
1
Morphology
 Study of the rules that govern the combination of
morphemes.
 Inflection: same word, different syntactic information
 Run/runs/running, book/books
 Derivation: new word, different meaning
 Often different part of speech, but not always
 Possible/possibly/impossible, happy/happiness
 Compounding: new word, each part is a word
 Blackboard, bookshelf
 lAlakamala, banabAsa
Lecture 3, 7/27/2005
Natural Language Processing
2
Morphology Level: The Mapping
 Formally: A+  2(L,C1,C2,...,Cn)
 A is the alphabet of phonemes (A+ denotes any non-empty
sequence of phonemes)
 L is the set of possible lemmas, uniquely identified
 Ci are morphological categories, such as:




grammatical number, gender, case
person, tense, negation, degree of comparison, voice, aspect, ...
tone, politeness, ...
part of speech (not quite morphological category, but...)
 A, L and Ci are obviously language-dependent
3
Bengali/Hindi Inflectional Morphology
 Certain languages encode more syntax in morphology than in
syntax
 Some of inflectional suffixes that nouns can have:




singular/plural :
Gender
possessive markers :
case markers :
 Different Karakas
 Inflectional suffixes that verbs can have:
 Hindi: Tense, aspect, modality, person, gender, number
 Bengali: Tense, aspect, modality, person
 Order among inflectional suffixes (morphotactics )
 Chhelederke
 baigulokei
Lecture 3, 7/27/2005
Natural Language Processing
4
Bengali/ Hindi Derivational Morphology
 Derivational morphology is very rich.
Lecture 3, 7/27/2005
Natural Language Processing
5
English Inflectional Morphology
 Nouns have simple inflectional morphology.
 plural -- cat / cats
 possessive -- John / John’s
 Verbs have slightly more complex inflectional, but still relatively
simple inflectional morphology.




past form -- walk / walked
past participle form -- walk / walked
gerund -- walk / walking
singular third person -- walk / walks
 Verbs can be categorized as:
 main verbs
 modal verbs -- can, will, should
 primary verbs -- be, have, do
 Regular and irregular verbs: walk / walked -- go / went
Lecture 3, 7/27/2005
Natural Language Processing
6
Regulars and Irregulars
 Some words misbehave (refuse to follow the rules)
 Mouse/mice, goose/geese, ox/oxen
 Go/went, fly/flew
 The terms regular and irregular will be used to refer to
words that follow the rules and those that don’t.
Lecture 3, 7/27/2005
Natural Language Processing
7
Regular and Irregular Verbs
 Regulars…
 Walk, walks, walking, walked, walked
 Irregulars
 Eat, eats, eating, ate, eaten
 Catch, catches, catching, caught, caught
 Cut, cuts, cutting, cut, cut
Lecture 3, 7/27/2005
Natural Language Processing
8
Derivational Morphology
 Quasi-systematicity
 Irregular meaning change
 Changes of word class
renationalizationability
 Some English derivational affixes
 -ation : transport / transportation







-er : kill / killer
-ness : fuzzy / fuzziness
-al : computation / computational
-able : break / breakable
-less : help / helpless
un : do / undo
re : try / retry
Lecture 3, 7/27/2005
Natural Language Processing
9
Derivational Examples
 Verb/Adj to Noun
-ation
computerize
computerization
-ee
appoint
appointee
-er
kill
killer
-ness
fuzzy
fuzziness
Lecture 3, 7/27/2005
Natural Language Processing
10
Derivational Examples
 Noun/Verb to Adj
-al
Computation
Computational
-able
Embrace
Embraceable
-less
Clue
Clueless
Lecture 3, 7/27/2005
Natural Language Processing
11
Compute
 Many paths are possible…
 Start with compute




Computer -> computerize -> computerization
Computation -> computational
Computer -> computerize -> computerizable
Compute -> computee
Lecture 3, 7/27/2005
Natural Language Processing
12
Parts of A Morphological Processor
 For a morphological processor, we need at least
followings:
 Lexicon : The list of stems and affixes together with basic
information about them such as their main categories (noun,
verb, adjective, …) and their sub-categories (regular noun,
irregular noun, …).
 Morphotactics : The model of morpheme ordering that explains
which classes of morphemes can follow other classes of
morphemes inside a word.
 Orthographic Rules (Spelling Rules) : These spelling rules are
used to model changes that occur in a word (normally when two
morphemes combine).
Lecture 3, 7/27/2005
Natural Language Processing
13
Lexicon
 A lexicon is a repository for words (stems).
 They are grouped according to their main categories.
 noun, verb, adjective, adverb, …
 They may be also divided into sub-categories.
 regular-nouns, irregular-singular nouns, irregular-plural nouns, …
 The simplest way to create a morphological parser, put all possible
words (together with its inflections) into a lexicon.
 We do not this because their numbers are huge (theoretically for
Turkish, it is infinite)
Lecture 3, 7/27/2005
Natural Language Processing
14
Morphotactics
 Which morphemes can follow which morphemes.
Lexicon:
regular-noun
irregular-pl-noun
fox
cat
dog
geese
sheep
mice
irreg-sg-noun
goose
sheep
mouse
plural
-s
 Simple English Nominal Inflection (Morphotactic Rules)
reg-noun
0
1
plural (-s)
irreg-sg-noun
2
irreg-pl-noun
Lecture 3, 7/27/2005
Natural Language Processing
15
Combine Lexicon and Morphotactics
o
x
f
c
d
a
t
o
g
s
s
h
e
e
e
g
o
o
m
p
s
e
e
o
u
i
e
s
c
This only says yes or no. Does not give lexical representation.
It accepts a wrong word (foxs).
Lecture 3, 7/27/2005
Natural Language Processing
16
FSAs and the Lexicon
 This will actual require a kind of FSA : the Finite State
Transducer (FST)
 We will give a quick overview
 First we’ll capture the morphotactics
 The rules governing the ordering of affixes in a language.
 Then we’ll add in the actual words
Lecture 3, 7/27/2005
Natural Language Processing
17
Simple Rules
Lecture 3, 7/27/2005
Natural Language Processing
18
Adding the Words
Lecture 3, 7/27/2005
Natural Language Processing
19
Derivational Rules
Lecture 3, 7/27/2005
Natural Language Processing
20
Parsing/Generation
vs. Recognition
 Recognition is usually not quite what we need.
 Usually if we find some string in the language we need to find the
structure in it (parsing)
 Or we have some structure and we want to produce a surface form
(production/generation)
 Example
 From “cats” to “cat +N +PL”
Lecture 3, 7/27/2005
Natural Language Processing
21
Why care about morphology?
 `Stemming’ in information retrieval
 Might want to search for “aardvark” and find pages with both
“aardvark” and “aardvarks”
 Morphology in machine translation
 Need to know that the Spanish words quiero and quieres are
both related to querer ‘want’
 Morphology in spell checking
 Need to know that misclam and antiundoggingly are not words
despite being made up of word parts
Lecture 3, 7/27/2005
Natural Language Processing
22
Can’t just list all words
 Turkish word Uygarlastiramadiklarimizdanmissinizcasin
 `(behaving) as if you are among those whom we could not civilize’











Uygar
las
tir
ama
dik
lar
imiz
dan
mis
siniz
casina
Lecture 3, 7/27/2005
`civilized’ +
`become’ +
`cause’ +
`not able’ +
`past’ +
‘plural’+
‘p1pl’ +
‘abl’ +
‘past’ +
‘2pl’ +
‘as if’
Natural Language Processing
23
Finite State Transducers
 The simple story
 Add another tape
 Add extra symbols to the transitions
 On one tape we read “cats”, on the other we write “cat +N
+PL”
Lecture 3, 7/27/2005
Natural Language Processing
24
Transitions
c:c
a:a
t:t
+N:ε
+PL:s
 c:c means read a c on one tape and write a c on the other
 +N:ε means read a +N symbol on one tape and write nothing on the other
 +PL:s means read +PL and write an s
Lecture 3, 7/27/2005
Natural Language Processing
25
Lexical to Intermediate Level
Lecture 3, 7/27/2005
Natural Language Processing
26
FST Properties
 FSTs are closed under: union, inversion, and composition.
 union : The union of two regular relations is also a regular relation.
 inversion : The inversion of a FST simply switches the input and
output labels.
 This means that the same FST can be used for both directions of a
morphological processor.
 composition : If T1 is a FST from I1 to O1 and T2 is a FST from O1
to O2, then composition of T1 and T2 (T1oT2) maps from I1 to O2.
 We use these properties of FSTs in the creation of the FST for a
morphological processor.
Lecture 3, 7/27/2005
Natural Language Processing
27
A FST for Simple English Nominals
+N: є
reg-noun
irreg-sg-noun
+N: є
irreg-pl-noun
+S:#
+PL:^s#
+SG:#
+PL:#
+N: є
Lecture 3, 7/27/2005
Natural Language Processing
28
FST for stems
 A FST for stems which maps roots to their root-class
reg-noun
irreg-pl-noun
irreg-sg-noun
fox
cat
dog
g o:e o:e se
sheep
m o:i u:є s:c e
goose
sheep
mouse
 fox stands for f:f o:o x:x
 When these two transducers are composed, we have a FST which maps
lexical forms to intermediate forms of words for simple English noun
inflections.
 Next thing that we should handle is to design the FSTs for orthographic
rules, and combine all these transducers.
Lecture 3, 7/27/2005
Natural Language Processing
29
Multi-Level Multi-Tape Machines
 A frequently use FST idiom, called cascade, is to have the output of one
FST read in as the input to a subsequent machine.
 So, to handle spelling we use three tapes:
 lexical, intermediate and surface
 We need one transducer to work between the lexical and intermediate
levels, and a second (a bunch of FSTs) to work between intermediate and
surface levels to patch up the spelling.
lexical
Lecture 3, 7/27/2005
d
o
g
+N +PL
intermediate
d
o
g
^
surface
d
o
g
s
Natural Language Processing
s #
30
Lexical to Intermediate FST
Lecture 3, 7/27/2005
Natural Language Processing
31
Orthographic Rules
 We need FSTs to map intermediate level to surface level.
 For each spelling rule we will have a FST, and these FSTs run parallel.
 Some of English Spelling Rules:
 consonant doubling -- 1-letter consonant doubled before ing/ed -beg/begging
 E deletion - Silent e dropped before ing and ed -- make/making
 E insertion -- e added after s, z, x, ch, sh before s -- watch/watches
 Y replacement -- y changes to ie before s, and to i before ed -- try/tries
 K insertion -- verbs ending with vowel+c we add k -- panic/panicked
 We represent these rules using two-level morphology rules:
 a => b / c__d
rewrite a as b when it occurs between c and d.
Lecture 3, 7/27/2005
Natural Language Processing
32
FST for E-Insertion Rule
E-insertion rule: є => e / {x,s,z}^ __ s#
Lecture 3, 7/27/2005
Natural Language Processing
33
Generating or Parsing with FST Lexicon
and Rules
Lecture 3, 7/27/2005
Natural Language Processing
34
Accepting Foxes
Lecture 3, 7/27/2005
Natural Language Processing
35
Intersection
 We can intersect all rule FSTs to create a single FST.
 Intersection algorithm just takes the Cartesian product of
states.
 For each state qi of the first machine and qj of the second
machine, we create a new state qij
 For input symbol a, if the first machine would transition to state
qn and the second machine would transition to qm the new
machine would transition to qnm.
Lecture 3, 7/27/2005
Natural Language Processing
36
Composition
 Cascade can turn out to be somewhat pain.
 it is hard to manage all tapes
 it fails to take advantage of restricting power of the machines
 So, it is better to compile the cascade into a single large machine.
 Create a new state (x,y) for every pair of states x є Q1 and y є Q2.
The transition function of composition will be defined as follows:
δ((x,y),i:o) = (v,z) if
there exists c such that δ1(x,i:c) = v
and δ2(y,c:o) = z
Lecture 3, 7/27/2005
Natural Language Processing
37
Intersect Rule FSTs
lexical tape
LEXICON-FST
intermediate tape
FST1 … FSTn
=> FSTR = FST1 ^ … ^ FSTn
surface tape
Lecture 3, 7/27/2005
Natural Language Processing
38
Compose Lexicon and Rule FSTs
lexical tape
lexical tape
LEXICON-FST
intermediate tape
=> LEXICON-FST o FSTR
FSTR = FST1 ^ … ^ FSTn
surface level
surface tape
Lecture 3, 7/27/2005
Natural Language Processing
39
Porter Stemming
 Some applications (some informational retrieval
applications) do not the whole morphological processor.
 They only need the stem of the word.
 A stemming algorithm (Port Stemming algorithm) is a
lexicon-free FST.
 It is just a cascaded rewrite rules.
 Stemming algorithms are efficient but they may introduce
errors because they do not use a lexicon.
Lecture 3, 7/27/2005
Natural Language Processing
40