Morphology from a computational point of view

Download Report

Transcript Morphology from a computational point of view

Morphology from a
computational point of view
March 2001
Today
Minimal Edit Distance, and Viterbi more
generally;
 Letter to Sound
 What is morphology?
 Finite-state automata
 Finite-state phonological rules

1. What is morphology?
Study of the internal structure of words:
 morph-ology word-s jump-ing
Why?
1. For some purposes, we need to know
what the internal pieces are.
2. Knowledge of the words of a language
can’t be summarized in a finite list: we
need to know the principles of wordformation
Some resources
Richard Sproat: Morphology and
Computation (MIT Press, 1992)
 Excellent overview of computational
morphology and phonoloy by Harald
Trost at
http://www.ai.univie.ac.at/~harald/handbo
ok.html

2. What applications need
knowledge of words?
Any high-level linguistic analysis:
syntactic parser
machine translation
speech recognition, text-to-speech (TTS)
information retrieval (IR)
dictionary, spell-checker
3. A list is not enough
An empirical fact:
AP newswire: mid-Feb – Dec 30 1988
Nearly 300,000 words.
“New” words that appeared on Dec 31
1988:
compounds: prenatal-care, publiclyfunded, channel-switching, ownerpresident, logic-loving, part-Vulcan,
signal-emitting, landsite, governmentaligned, armhole, signal-emitting...
...new words...
dumbbells, groveled, fuzzier, oxidized
ex-presidency, puppetry, boulderlike,
over-emphasized, hydrosulfite,
outclassing, non-passengers, racialist,
counterprograms, antiprejudice, reunification, traumatological, refinancings,
instrumenting, ex-critters, mega-lizard
ex-presidency: prefix ex boulder-like: suffix –like
 over-emphasized: prefix over antiprejudice: prefix anti
This is often called the OOV problem
(“out of vocabulary”).

If we work out the principles of wordformation, we will simultaneously:
1. compress the size of our internalized
list of words;
2. become able to deal with new words
on the fly.
4. Overview of applications that
need knowledge of words
Speech generation: text to speech (TTS)
 Speech recognition

Text to speech
Problem: take text, in standard spelling,
and produce a sequence of phonemes
which can be synthesized by the
“backend”.
Severe problems: Proper names (persons,
places), OOV words
boathouse B OW1 T H AU2 S
Speech recognition
Take a sound file (e.g., *.wav) and
produce a list of words in standard
orthography.
Bill Clinton is a recent ex-president.
If someone says it, we need to figure out
what the word was.
Do we know what a word is?
This is actually not an easy question! –
especially if we turn to Asian languages,
without a tradition of putting in “white
space” between “words”, as we do in
the West.
 German writes more compounds
without white space than English does.

Basic principles of morphology
For some purposes, we need to think
about phonemes, while for others it’s
more convenient to talk about letters.
 For our purposes, I’ll talk about letters
whenever we don’t need to specifically
focus on phonemes.

Morpheme

It is convenient to be able to talk about
the pieces into which words may be
broken, and linguists call these pieces
morphemes: the smallest parts of a
language that can be regularly assigned
a meaning.
Morphemes
Uncontroversial morphemes:
door, dog, jump, -ing, -s, to
More controversial morphemes
sing/sang: s-ng + i/a
cut/cut: cut + PAST
Classic distinctions in
morphology:
Analytic (isolating) languages:
– no morphology of derivational or
inflectional sort.
Synthetic (inflecting) languages:
– Agglutinative: 1 function per morpheme
– Fusional: > 1 function per morpheme
Agglutinative:
Finnish Nominal Declension
talo 'the-house'
kaup-pa 'the-shop'
talo-ni 'my house'
kaup-pa-ni 'my shop'
talo-ssa 'in the-house'
shop'
kaup-a-ssa 'in the-
talo-ssa-ni 'in my house’
kaup-a-ssa-ni 'in my shop'
talo-i-ssa 'in the-houses’
kaup-o-i-ssa 'in the-shops'
talo-i-ssa-ni 'in my houses’ kaup-o-i-ssa-ni 'in my shops'
Courtesy of Bucknell Univ. web page
Fusional: Latin
Latin Declension of hortus 'garden'
Singular
Nominative (Subject)
Plural
hort-us
Genitive (of)
hort-i
hort-rum
Dative (for/to)
hort-o
hort-is
Accusative (Direct Obj)
hort-um
hort-us
Vocative (Call)
hort-e
hort-i
Ablative (from/with)
hort-o
hort-is
hort-
Morphemes vs. morphs
Some analysts distinguish between
“morphemes” and “morphs”.
 Morphemes are motivated by an
analysis, and include “plural” and “past”
 Morphs are strings of letters or phones
that “realize” or “manifest” a morpheme.

Free and bound morphemes
Free morphemes can form (freestanding) words; bound morphemes are
only found in combination with other
morphemes.
 Examples?

Functions of morphology
Derivational morphology: creates one
lexeme from another
compute > computer > computerize >
computerization
Inflectional morphology: creates the
form of a lexeme that’s right for a
sentence:
the nominative singular form of a noun; or
the past 3rd person singular form of a
verb.

Word: an identifiable string of letters (or
phonemes) sing
 Word-form: a word with a specific set of
syntactic and morphological features. The
sing in I sing is 1st person sg, and is a distinct
word-from from the sing in you sing.
 Lexeme: a complete set of inflectionally
related word-forms, including sing, sings,
and sang
 Lemma: a complete set of morphologically
related lexemes: sing, sings, song, sang.
A lexeme’s stem
In many languages (unlike English),
constellations of word-forms forming a
lexeme demand the recognition of a
basic stem which does not stand freely
as a word:
Italian
ragazzo, ragazzi (boy, girl)
ragazzi, ragazze (boys, girls)
ragazz-
Compounds
Compounds are composed of 2 (or more)
words or stems
Compounds: hot dog, White House,
bookstore, cherry-covered
Languages vary in the amount of
morphology they have and use
English has a lot of derivational
morphology and relatively little
inflectional morphology
English verb’s inflectional forms:
bare stem, -s, -ed, -ing
European languages
Not uncommon for a verb to have 30 to
50+ forms:
marking tense, person and number of the
subject
Derivation
Derivational morphology usually consists of
adding a prefix or suffix to a base (= stem).
The base has a lexical category (it is a noun,
verb, adjective), and the suffix typically
assigns a different category to the whole word.
Noun
-ness: suffix that takes
an adjective, & makes a noun.
Adj
sad
ness
Adj
Adj
Verb
un interest
ing
Distinct from contractions…
English (and some other languages)
permit the collapsing together of
common words. In some extremely rare
cases, only the collapsed form exists
(English possessive ’s).
He will arrive tonight > he’ll arrive…
The [King of England]’s children
Some basics of English morphology
Inflectional morphology
Nouns: -NULL, -s, -’s
Verbs: -NULL, s, -ed, -ing
(so-called weak verbs)
Strong verbs: 3 major groups
a. Internal verb change (sing/sang,
drive/drove/driven, dive/dove)
b. –t suffix, typically with vowel-shortening
dream/dreamt, sleep/slept
c. –aught replacement: catch, teach, seek,
Derivational morphology in
complex
This morphology creates new words, by
adding prefixes or suffixes.
It is helpful to divide them into two groups,
depending on whether they leave the
pronunciation of the base unchanged or
not.
There are, as always, some fuzzy cases.
Level 1
ize, ization, al, ity, al, ic, al, ity,
ion, y (nominaliz-ing), al,
ate, ous, ive, ation
Can attach to non-word stems
(fratern-al, paternal; parental)
Typically change stress and
vowel quality of stem
Level 2
Never precede Level 1
suffixes
Never change stress
pattern or vowel quality
Almost always attach to
words that already exist
hood, ness, ly, s, ing, ish,
ful, ly, ize, less, y (adj.)
Combinations of Class 1,2
Class 1 + Class 1: histor-ic-al, illuminaat-tion, indetermin-at-y;
 Class 1 + Class 2: frantern-al-ly,
transform-ate-ion-less;
 Class 2 + Class 2: weight-less-ness
 ?? Class 2 + Class 1: *weight-less-ity,
fatal-ism-al

Signature

Set of suffixes (or prefixes) that occurs
in a corpus with a set of stems.
NULL.ed.ing.s
look interest add claim mark extend demand
remain want succeed record offer represent
cover return end explain follow help belong
attempt talk fear happen assault account
point award appeal train contract result
request staff view fail kick visit confront attack
comment sponsor
NULL.s
paper retain improvement missile song truth doctor
indictment window conductor dick misunderstanding
struggle stake tank belief cafeteria material mind
operator bassi lot movement chain notion marriage
dancer scholarship reservoir sweet right battalion hold
mr shot cardinal athletic revenue duel confrontation
solo talent guest shoe russian commitment average
monk election street roger rifle worker area plane
pinch-hitter dozen browning conclusion teacher
narcotic appearance alternative dealer producer mile
stock shrine sometime bag successor career mistake
ankle weapon model front spotlight rhode pace debate
payment requirement fairway consultation chip dollar
employer thank mustang rocket-bomb hat string
precinct robert employee action detective pressure
measure spirit forbid hitter breast yankee partner floor
member
NULL.d.s
increase tie hole associate reserve price fire
receive challenge rate purchase propose
feature celebrate decide suite single change
sculpture combine privilege pledge issue
frame indicate believe damage include use
aide graduate surprise intervene practice
trouble serve oppose promise charge note
schedule continue raise decline cause
operate emphasize relieve hope share judge
birdie produce exchange
NULL.ed.er.ing.s report turn walk park pick flow
NULL.d.ment enforce announce engage arrange
replace improve encourage
NULL.n.s rose low take law drive rise undertake
NULL.al intern profession logic fat tradition extern
margin jurisdiction historic education promotion
constitution addition sensation roy ration origin
classic convention
NULL.man sand news police states gross sun fresh
sports boss sales 3- patrol bonds
ed.er.ing slugg manag crush publish robb
NULL.ity.s major senior moral hospital
NULL.ry hung mason ave summit scene surge rival
forest
NULL.a.s indian kind american
Finite state morphology
FSA: finite-state automata
Consists of
1. a set of states
2. a starting state
3. a set of final or accepting states
4. a finite set of symbols
5. a set of transitions: each is defined by
a from-state, a to-state, and a symbol

It’s natural to think of this as describing
an annotated directed graph.
a
aa
b
q0
q1
aa
q2
!
q3
!
q3
An FSA can be thought of as judging
(accepting) a string, or as generating
one.
 How does it judge? Find a start to finish
path that matches the string.
 How does it generate? Walk through
any start-to-finish path.

Deterministic and Nondeterministic FSAs
Just a little difference:
 Deterministic case: For every state,
there is a maximum of one transition
associated with any given symbol. You
can say that there’s a function from
{states}X{symbols}  {states}
 Nondeterministic case: There is no such
restriction; hence, given a state and a
symbol, it is not necessarily certain
which transition is to be taken.
a
a
b
q0
q1
!
a
q2
deterministic…
q3
q3
a
a
b
q0
q1
!
a
q2
q3
non-deterministic
The best things in life
are non-deterministic.
q3
Figure 3.4 p. 68
unq0
q1
adj-root
q2
-er –est -ly
q3
e
Alternate
  er 


(un)adj roots( est )
  ly 
notation


Yet a third way: rows in an array
(to-column can consist of
pointers)
From
0
0
1
2
To
1
1
2
3
Stop states: 2,3
Output
un
NULL
adj-root-list
er;est;ly
adj-root-1
un-
-er –est -ly
q1
q0
q2
q5
adj-root-1
e
q3
q4
adj-root-2
  er 


(un)adj roots1( est )
  ly 


Figure 3.5 p. 69
-er –est
  er 
adj roots2 ( )
 est 
Yet a third way: rows in an array
(to-column can consist of
Stop states: 2,4,5
pointers)
From
To
Output
0
0
1
2
3
3
4
1
3
2
5
2
4
5
un
NULL
adj-root-list-1
er;est;ly
adj-root-list-1
adj-root-list-2
er;est
0
1
2
2
2
0
0
5
0
7
8
8
0
10
0
11
1
2
3
4
5
5
1
6
7
8
6
9
10
8
11
8
Figure 3.6, p. 70
noun 1
ize
ation
er
able
-al “adj”
adj-al
ity;ness
verbs1
ive “adj”
ness
ly
verb2
ative
noun2
ful
We can simplify greatly (generating a
bit more….)
nouns
noun
ity, ness
ize
ation
er
verbs
verb
ative
ive
able
adjectives
ful
adj
ly
ly
adverbs
adverb
Finite-State Transducers (FST)
The symbols of the FST are complex:
they’re really pairs of symbols, one for
each of two “tapes” or levels.
Recognizer: decides if a given pair of
representations fits together “OK”
Generator: generates pairs of
representations that fit together
Translator: takes a representation on one
level and produces the appropriate
representation on the other level
Finite state transducers
can be inverted, or
 composed
 and you get another FST.

Complex symbols
Usually of the form a:b, which means a
can appear on the upper tape when b
appears on the lower tape.
 So a:b means that’s a permissible
pairing up of symbols.
 “a” along means a:a, etc.
 epsilon e means null character.
 Remember, “other” means “any feasible
pair that is not in this transducer” (p. 78)

Using FSTs for orthographic rules
^:e
#
other
other
Z!
Z! = Z, s, x
q5
Z!
S
^:e
s
e:e
q0
Z!
q1
#, other
^:e
q2
q4
q3
z,x
#, other
 x
 
e  e /  s   __ s #
z
 
#
Using FSTs for orthographic rules
^:e
#
other
other
Z!
Z! = Z, s, x
q5
Z!
S
^:e
s
e:e
q0
Z!
q1
#, other
^:e
q2
q4
q3
z,x
#, other
#
fox^s#…we get to q1 with ‘x’
Using FSTs for orthographic rules
^:e
#
other
other
Z!
Z! = Z, s, x
q5
Z!
S
^:e
s
e:e
q0
Z!
q1
#, other
^:e
q2
q4
q3
z,x
#, other
#
fox^s#…we get to q2 with ‘^’
Using FSTs for orthographic rules
^:e
#
other
other
Z!
Z! = Z, s, x
q5
Z!
S
^:e
s
e:e
q0
Z!
q1
#, other
^:e
q2
q4
q3
z,x
#, other
#
fox^s#…we can get to q3
with ‘NULL’
Using FSTs for orthographic rules
^:e
#
other
other
Z!
Z! = Z, s, x
q5
Z!
S
^:e
s
e:e
q0
Z!
q1
#, other
^:e
q2
q4
q3
z,x
#, other
#
fox^s#…we also get to q5 with ‘s’
but we don’t want to!
So why is this transition there?
?friend^ship, ?fox^s^s (= foxes’s)
^:e
#
other
other
Z!
Z! = Z, s, x
q5
Z!
S
^:e
s
e:e
q0
Z!
q1
#, other
^:e
q2
q4
q3
z,x
#, other
#
fox^s#…we also get to q5 with ‘s’
but we don’t want to!
^:e
#
other
other
Z!
Z! = Z, s, x
q5
Z!
S
^:e
s
e:e
q0
Z!
q1
#, other
^:e
q2
q4
q3
z,x
#, other
#
fox^s#…q4 with s
^:e
#
other
other
Z!
Z! = Z, s, x
q5
Z!
S
^:e
s
e:e
q0
Z!
q1
#, other
^:e
q2
q4
q3
z,x
#, other
#
fox^s#…q0 with #
(accepting state)
Other transitions…
^:e
#
other
other
Z!
Z! = Z, s, x
q5
Z!
S
^:e
s
e:e
q0
Z!
q1
#, other
^:e
q2
q4
q3
z,x
#, other
#
arizona: we leave q0 but return
Other transitions…
^:e
#
other
other
Z!
Z! = Z, s, x
q5
Z!
S
^:e
s
e:e
q0
Z!
q1
#, other
^:e
q2
q4
q3
z,x
#, other
#
miss^s