my.fit.edu

Transcript my.fit.edu

Search and Decoding in Speech
Recognition
Words and
Transducers
Outline
1
2
3
4
5
6
7
8
9
21 July 2015
• Introduction
• Survey of English Morphology
• Inflectional Morphology
• Regular Verbs
• Derivational Morphology
• Cliticization
• Non-Concatenative Morphology
• Finite State Morphological Parsers
• Constructing Finite State Lexicon
Veton Këpuska
2
Outline
10
11
12
13
14
15
16
17
18
21 July 2015
• Finite State Transducers
• FST for Morphological Parsing
• Transducers and Orthographic Rules
• Combining FST Lexicon and Rules
• Lexicon-Free FSTs: The Porter Stemmer
• Word and Sentence Tokenization
• Detecting and Correcting Spelling Errors
• Human Morphological Processing
• Summary
Veton Këpuska
3
Outline
1
2
3
4
5
6
7
8
9
21 July 2015
• Introduction
• Survey of English Morphology
• Inflectional Morphology
• Regular Verbs
• Derivational Morphology
• Cliticization
• Non-Concatenative Morphology
• Finite State Morphological Parsers
• Constructing Finite State Lexicon
Veton Këpuska
4
Outline
10
11
12
13
14
15
16
17
18
21 July 2015
• Finite State Transducers
• FST for Morphological Parsing
• Transducers and Orthographic Rules
• Combining FST Lexicon and Rules
• Lexicon-Free FSTs: The Porter Stemmer
• Word and Sentence Tokenization
• Detecting and Correcting Spelling Errors
• Human Morphological Processing
• Summary
Veton Këpuska
5
Outline
1
2
3
4
5
6
7
8
9
21 July 2015
• Introduction
• Survey of English Morphology
• Inflectional Morphology
• Regular Verbs
• Derivational Morphology
• Cliticization
• Non-Concatenative Morphology
• Finite State Morphological Parsers
• Constructing Finite State Lexicon
Veton Këpuska
6
Introduction
21 July 2015
Veton Këpuska
7
Introduction


From Ch 1. – regular expressions we sow how easy it is to search for a plural of the
woodchuck (woodchucks).
However searching for plural of fox, fish, peccary or wild goose is not as trivial as just
tacking on an s.




Main Entry: fox
Pronunciation: 'fäks
Function: noun
Inflected Form(s): plural fox·es also fox
Usage: often attributive
Etymology: Middle English, from Old English; akin to Old High German fuhs fox and perhaps to
Sanskrit puccha tail
Main Entry: fish
Pronunciation: 'fish
Function: noun
Inflected Form(s): plural fish or fish·es
Usage: often attributive
Etymology: Middle English, from Old English fisc; akin to Old High German fisc fish, Latin piscis
Main Entry: pec·ca·ry
Pronunciation: 'pe-k&-rE
Function: noun
Inflected Form(s): plural -ries
Etymology: of Cariban origin; akin to Suriname Carib paki:ra peccary
: any of several largely nocturnal gregarious American mammals resembling the related pigs: as
a : a grizzled animal (Tayassu tajacu) with an indistinct white collar b : a blackish animal
(Tayassu pecari) with a whitish mouth region
Main Entry: goose
Pronunciation: 'güs
Function: noun
Inflected Form(s): plural geese /'gEs/
Etymology: Middle English gos, from Old English gOs; akin to Old High German gans goose, Latin
anser, Greek chEn
21 July 2015
Veton Këpuska
8
Introduction

Required knowledge to correctly search for singulars and plurals
in English language:
1.
2.
Orthographic rules: Words ending in –y are pluralized by changing the
–y to –i and adding an –es.
Morphological rules: tell us that fish has null plural and that the plural
of goose is formed by changing the vowel.

Morphological parsing: recognizing that a word (like foxes)
break down into component morphemes (fox and -es) and
building a structured representation of it.

Parsing means taking an input and producing some sort of
linguistic structure for it.
Parsing can be thought in broad terms producing structures based on:
String
Discourse
Semantics
Syntax
Morphology
21 July 2015
Producing
Veton Këpuska
Network
Output:
Input:
Tree

9
Introduction
 Morphological parsing (or stemming)
applies to many affixes other than plurals;
 Example: Parsing any English verbs ending in –
ing (e.g., going, talking, congratulating) into its
verbal stem plus the –ing morpheme.
 going ⇨ VERB-go + GERUND-ing
 Morphological parsing important for speech and
language processing:
 Part-of-speech tagging
 Dictionaries (spell-checking)
 Machine translation
21 July 2015
Veton Këpuska
10
Introduction

To solve morphological parsing problem one could just store all the
plural forms of English nouns and –ing forms of English verbs in
dictionary as in English Speech Recognition tasks.



For many Natural Language Processing applications this is not
possible because –ing is a productive suffix: that is, it applies to
every verb.
Similarly –s applies to almost every noun.
Productive suffixes apply to new words:


Example: fax and faxing
New words (e.g., acronyms and proper nouns) are created constantly –
need to add the plural morpheme –s to each.


Plural form of new nouns depends on the spelling/pronunciation of the
singular form (eg. The nouns ending in –z the plural is formed by
replacing it with –es).
In other languages (e.g., Turkish) one cannot list all the
morphological variants of every word:

21 July 2015
Turkish verbs have 40,000 possible forms not counting derivational
suffixes.
Veton Këpuska
11
Outline
21 July 2015
Veton Këpuska
12
Outline
1.
Survey of morphological knowledge for English and
some other languages
2. Introduction of finite-state transducer as the key
algorithm for morphological parsing.
i. Finite-state transducers are key algorithms for
speech and language processing.
3.
Related algorithms:
i. Stemming: mapping from the word to its root or
stem. Important to Information Retrieval tasks.
ii. Need to know if two words have a similar root
despite their surface differences
a. Example: sang and sung. The word sing is
called the common lemma of these words, and
mapping form all these to sing is called
lemmatization.
21 July 2015
Veton Këpuska
13
Outline
iii.
Tokenization or Word Segmentation – a related
algorithms to morphological parsing that is defined as a
task of separating out (tokenizing) words from running
text.
a.
4.
English language text separates words by white space
but:
 New York, rock ‘n’ roll – are considered single words
 I’m – is considered two words “I” and “am”
For many applications we need to know how similar
two words are orthographically.
i. Morphological parsing is one method for computing
similarity,
ii. Comparison of strings of letters via minimum edit
distance algorithm.
21 July 2015
Veton Këpuska
14
Outline
1
2
3
4
5
6
7
8
9
21 July 2015
• Introduction
• Survey of English Morphology
• Inflectional Morphology
• Regular Verbs
• Derivational Morphology
• Cliticization
• Non-Concatenative Morphology
• Finite State Morphological Parsers
• Constructing Finite State Lexicon
Veton Këpuska
15
Survey of English
Morphology
21 July 2015
Veton Këpuska
16
Survey of English Morphology
 Morphology is the study of the way words
are built up from smaller meaning-bearing
units - morphemes.
 Morpheme is often defined as the minimal
meaning-bearing unit in a language.
 Main Entry: mor·pheme
Pronunciation: 'mor-"fEm
Function: noun
Etymology: French morphème, from Greek
morphE form
: a distinctive collocation of phonemes (as the
free form pin or the bound form -s of pins)
having no smaller meaningful parts
21 July 2015
Veton Këpuska
17
Survey of English Morphology
 Example:



fox consists of a single morpheme: fox.
cats consists of two morphemes: cat and –s.
Two broad classes of morphemes:
1. Stems - main morpheme of a word, and
2. Affixes – add additional meaning to the word.
i. Prefixes – preceding the stem: unbuckle
ii. Suffixes – following the stem: eats
iii. Infixes – inserted in the stem: humingi (Philippine
language Tagalog)
iv. Circumfixes – precede and follow the stem. gesagt
(German past participle of sagen)
21 July 2015
Veton Këpuska
18
Survey of English Morphology

A word can have more than one affix:
 rewrites:
 Prefix
- re
 Stem
- write
 Suffix
-s
 unbelievably:
 Prefix
- un
 Stem
- believe
 Suffix
- able, ly

English language does not tend to stack more than four or five
affixes
Turkish can have words with nine or ten affixes – languages
like Turkish are called agglutinative languages.






Main Entry: ag·glu·ti·na·tive
Pronunciation: \ə-ˈglü-tən-ˌā-tiv, -ə-tiv\
Function: adjective
Date: 1634
1 : adhesive
2 : characterized by linguistic agglutination
21 July 2015
Veton Këpuska
19
Survey of English Morphology

There are many ways to combine morphemes to create a word.
Four methods are common and play important role in speech
and language processing:
1. Inflection
 Combination of a word stem with a grammatical
morpheme, usually resulting in a word of the same
class as the original stem, and usually filling some
syntactic function like agreement.
 Example:
 -s: plural of nouns
 -ed: past tense of verbs.
21 July 2015
Veton Këpuska
20
Survey of English Morphology
2. Derivation
 Combination of word stem with a grammatical
morpheme, usually resulting in a word of a different
class, often with a meaning hard to predict.
 Example:
 Computerize – verb
 Computerization – noun.
3. Compounding
 Combination of multiple word stems together.
 Example:
 Doghouse: dog + house.
4. Cliticization
 Combination of a word stem with a clitic. A clitic is a
morpheme that acts syntactically like a word, but is
reduced in form and attached (phonologically and
sometimes orthographically) to another word.
 Example:
 I’ve = I + ‘ve = I + have
21 July 2015
Veton Këpuska
21
Outline
1
2
3
4
5
6
7
8
9
21 July 2015
• Introduction
• Survey of English Morphology
• Inflectional Morphology
• Regular Verbs
• Derivational Morphology
• Cliticization
• Non-Concatenative Morphology
• Finite State Morphological Parsers
• Constructing Finite State Lexicon
Veton Këpuska
22
Inflectional
Morphology
21 July 2015
Veton Këpuska
23
Inflectional Morphology

English language has a relatively simple inflelectional system; Only




Nouns
Verbs
Adjectives (sometimes)
Number of possible inflectional affixes is quite small.
21 July 2015
Veton Këpuska
24
Inflectional Morphology: Nouns

Nouns (English):



Plural
Possessive
Many (but not all) nouns can either appear in
 bare stem or singular form, or
 Take a plural suffix
Regular Nouns
Irregular Nouns
Singular Cat
Thrush
Mouse
Ox
Plural
Thrushes
Mice
Oxen
21 July 2015
Cats
Veton Këpuska
25
Inflectional Morphology
 Regular plural spelled:


-s
-es after words ending in





–s (ibis/ibises)
-z (waltz/waltzes)
-sh (thrush/thrushes)
-ch (finch/finches)
-x (box/boxes); sometimes
 Nouns ending in –y preceded by a consonant change the –y
to –i (butterfly/butterflies).
 The possessive suffix is realized by apostrophe + -s for


Regular singular nouns (llama’s), and
Plural nouns not ending in –s (children’s), and often


Regular plural nouns (llamas’), and some
Names ending in –s or –z (Euripides’ comedies’).
 Lone apostrophe after
21 July 2015
Veton Këpuska
26
Inflectional Morphology: Verbs
 English language inflection of verbs is
more complicated than nominal inflection.
1. English has three kinds of verbs
i. Main verbs (eat, sleep, impeach)
ii. Modal verbs (can, will, should)
iii. Primary verbs (be, have, do)
 Concerned with main and primary verbs
because these have inflectional endings.
 Of these verbs a large class are regular (all
verbs in this class have the same endings
marking the same functions)
21 July 2015
Veton Këpuska
27
Regular Verbs


Regular Verbs have four morphological forms.
For regular verbs we know the other forms by
adding one of three predictable endings and
making some regular spelling changes.
Morphological Form
Class
Regularly Inflected Verbs
Stem Walk
-s form Walks
-ing participle Walking
Past form or –ed Walked
participle
21 July 2015
Merge
Try
Map
Merges
Tries
Maps
Merging
Trying
Mapping
Merged
Tried
Mapped
Veton Këpuska
28
Regular Verbs
 Since regular verbs
1. Cover majority of the verbs and forms, and
2. Regular class is productive,
they are significant in the morphology of
English language.
 Productive class is one that automatically
includes any new words that enter the
language.
21 July 2015
Veton Këpuska
29
Irregular Verbs


Irregular Verbs are those that have some more or less
idiosyncratic forms of inflection.
English irregular verbs




often have five different forms, but can have
as many as eight (e.g., the verb be), or
as few as three (e.g., cut or hit)
They constitute a smaller class of verbs estimated to be about 250
Morphological Form Class
Irregularly Inflected Verbs
Stem Eat
-s form Eats
-ing participle Eating
Past form Ate
–ed participle Eaten
21 July 2015
Veton Këpuska
Catch
Cut
Catches
Cuts
Catching
Cutting
Caught
Cut
Caught
Cut
30
Usage of Morphological Forms for
Irregular Verbs
 The –s form:

Used in “habitual present” form to distinguish the third-person
singular ending: “She jogs every Tuesday” from the other
choices of person and number “I/you/we/they jog every
Tuesday”.
 The stem form:

Used in in the infinitive form, and also after certain other verbs
“I’d rather walk home, I want to walk home”
 The –ing participle is used in the progressive construction
to mark a present or ongoing activity “It is raining”, or
when the verb is treated as a noun (this particular kind of
nominal use of a verb is called gerund use: “Fishing is fine if
you live near water”)
 The –ed participle is used in the perfect construction “He’s
eaten lunch already”, or passive construction “The verdict
was overturned yesterday”
21 July 2015
Veton Këpuska
31
Spelling Changes
 A number of regular spelling changes occur at morpheme
boundaries.
 Example:




A single consonant letter is doubled before adding the –ing and
–ed suffixes: beg/begging/begged
If the final letter is “c”, the doubling is spelled “ck”:
picnic/picnicking/picnicked
If the base ends in a silent –e, it is deleted before adding –ing
and –ed: merge/merging/merged
Just as for nouns, the –s ending is spelled






–es after verb stems ending in –s (toss/tosses)
-z (waltz/waltzes)
-sh (wash/washes)
-ch (catch/catches)
-x (tax/taxes) sometimes.
Also like nouns, verbs ending in –y preceded by a consonant
change the –y to –i (try/tries).
21 July 2015
Veton Këpuska
32
Outline
1
2
3
4
5
6
7
8
9
21 July 2015
• Introduction
• Survey of English Morphology
• Inflectional Morphology
• Regular Verbs
• Derivational Morphology
• Cliticization
• Non-Concatenative Morphology
• Finite State Morphological Parsers
• Constructing Finite State Lexicon
Veton Këpuska
33
Derivational
Morphology
21 July 2015
Veton Këpuska
34
Derivational Morphology
 Derivation is combination of a word stem with
a grammatical morpheme
 Usually resulting in a word of a different
class,
 Often with a meaning hard to predict exactly
 English inflection is relatively simple compared
to other languages.
 Derivation in English language is quite
complex.
21 July 2015
Veton Këpuska
35
Derivational Morphology
 A common kind of derivation in English is the formation
of



new nouns,
From verbs, or
Adjectives, called nominalization.

Suffix –ation produces nouns from verbs ending often in
the suffix –ize (computerize → computerization)
 Example:
Suffix
Base Verb/Adjective Derived Noun
-ation
Computerize (V)
Computerization
-ee
Appoint (V)
Appointee
-er
Kill (V)
Killer
-ness
Fuzzy (A)
Fuzziness
21 July 2015
Veton Këpuska
36
Derivational Morphology
 Adjectives can also be derived from nouns
and verbs
Suffix
Base Noun/Verb
Derived Adjective
-al
Computation (N)
Computational
-able
Embrace (V)
Embraceable
-less
Clue (N)
Clueless
21 July 2015
Veton Këpuska
37
Complexity of Derivation in English
Language
 There a number of reasons for complexity
in Derivation in English:
1. Generally less productive:
 Nominalizing suffix like –ation, which can be
added to almost any verb ending in –ize, cannot
be added to absolutely every verb.
 Example: we can’t say *eatation or *spellation
(* marks stem of words that do not have the
named suffix in English)
2. There are subtle and complex meaning
differences among nominalizing suffixes
 Example: sincerity vs sincereness
21 July 2015
Veton Këpuska
38
Outline
1
2
3
4
5
6
7
8
9
21 July 2015
• Introduction
• Survey of English Morphology
• Inflectional Morphology
• Regular Verbs
• Derivational Morphology
• Cliticization
• Non-Concatenative Morphology
• Finite State Morphological Parsers
• Constructing Finite State Lexicon
Veton Këpuska
39
Cliticization
21 July 2015
Veton Këpuska
40
Cliticization

clitic
noun (linguistics)
a morpheme that functions like a word, but appears not as an independent word but rather is
always attached to a following or preceding word. In English, the possessive ('s), -'s is an
example.

cliticization
noun
process or instance of a word becoming a clitic
21 July 2015
Veton Këpuska
41
Cliticization
 Clitic is a unit whose status lies in between
that of an affix and a word.
 Phonological behavior:
 Short
 Unaccented
 Syntactic behaviour:
 Words, acting as:
 Pronouns,
 Articles,
 Conjunctions
 Verbs
21 July 2015
Veton Këpuska
42
Cliticization
 Proclitics – clitics proceeding a word
 Enclitics – clitics following a word
Full Form
Clitic
Full Form
Clitic
am
‘m
have
‘ve
are
‘re
has
‘s
is
‘s
had
‘d
will
‘ll
would
‘d
 Ambiguity
 She’s → she is or she has
21 July 2015
Veton Këpuska
43
Outline
1
2
3
4
5
6
7
8
9
21 July 2015
• Introduction
• Survey of English Morphology
• Inflectional Morphology
• Regular Verbs
• Derivational Morphology
• Cliticization
• Non-Concatenative Morphology
• Finite State Morphological Parsers
• Constructing Finite State Lexicon
Veton Këpuska
44
Non-Concatenative
Morphology
21 July 2015
Veton Këpuska
45
Non-Concatenative Morphology
 Morphology discussed so far is called concatenative
morphology.
 Other (than English) languages have extensive nonconcatenative morphology:


Morphemes are combined in more complex ways - Tagalog
Arabic, Hebrew and other Semitic languages exhibit templatic
morphology or root-and-pattern morphology.
21 July 2015
Veton Këpuska
46
Agreement
21 July 2015
Veton Këpuska
47
Agreement
 In English language plural is marked on both
nouns and verbs.
 Consequently the subject noun and the main
verb have to agree in number:
 Both must be either singular or plural.
21 July 2015
Veton Këpuska
48
Outline
1
2
3
4
5
6
7
8
9
21 July 2015
• Introduction
• Survey of English Morphology
• Inflectional Morphology
• Regular Verbs
• Derivational Morphology
• Cliticization
• Non-Concatenative Morphology
• Finite State Morphological Parsers
• Constructing Finite State Lexicon
Veton Këpuska
49
Finite-State
Morphological
Parsing
Finite-State Morphological Parsing
 Goal of Morphological Parsing is to take the
column 1 in the following table and produce
output forms like those in the 2 column
 Example link-grammar:
http://www.link.cs.cmu.edu/link/
Input
Morphologically Parsed
Output
Input
Morphologically Parsed Output
cats
cat+N+PL
goose
goose+V
cat
cat+N+SG
gooses
goose+V+1P+SG
cities
city+N+PL
merging
merge+V+PresPart
geese
goose+N+PL
caught
catch+V+PastPart
goose
goose+N+SG
caught
catch+V+Past
21 July 2015
Veton Këpuska
51
Finite-State Morphological
 The second/fourth column of the previous slide table
contains stem of each word as well as assorted
morphological features. These features provide
additional information about the stem.
 Example:
 +N – word is a noun
 +SG – word is singular
 Some of the input forms may be ambiguous:
 Caught
 Goose
 For now we will consider the goal of morphological
parsing as merely listing of all possible parses. Task of
disambiguation among morphological parses will be
discussed in Chapter 5.
21 July 2015
Veton Këpuska
52
Requirements of Morphological Parser
1.
Lexicon: the list of stems and affixes, together with
basic information about them.
 Example: whether a stem is a Noun or a Verb, etc.
2.
Morphotactics: the model of morpheme ordering that
explains which classes of morphemes can follow other
classes of morphemes inside a word.
 Example: English plural morpheme follows the noun
and does not precede it.
3.
Orthographic rules: these are spelling rules that are
used to model the changes that occur in a word, usually
when two morphemes combine.
 Example: They *y → *ie spelling rule as in city + -s
→ cities.
21 July 2015
Veton Këpuska
53
Links to Morphological Parsers



PC-KIMMO, a morphological parser

Downloads and documentation for the PC-KIMMO morphological parser, as well as background
information and research in computational morphology.
www.sil.org/pckimmo/
Hermit Crab

Hermit Crab, a morphological parser and generator for classical generative phonology and
morphology.
www.sil.org/computing/HermitCrab/
3.2 Morphological Parsing

The goal of morphological parsing is to find out what morphemes a given word is built from. For
example, a morphological parser should be able to tell us ...
http://cs.union.edu/~striegnk/courses/nlp-with.../node21.html

NLPUtils Project

Here I will be posting the various C# class libraries I have developed for use in natural language
processing projects by the CASPR and IHUT projects in the Artificial Intelligence ...

www.cs.uga.edu/~wcb/nlp

CASPR - Computer Analysis of Speech for Psychological Research

NLPUtils (Boisclair tokenizer and morphological analyzer) Verstaile tokenizer and morphological
analyzer for English, in C#. Programs and documentation (updated 2008 March 21) Project ...

www.ai.uga.edu/caspr
21 July 2015
Veton Këpuska
54
Links to Morphological Parsers


Celex readme

The CELEX lemma lexicon is the one most similar to an ordinary dictionary since every entry in
this lexicon represents a set of related inflected words.

www.ldc.upenn.edu/readme_files/celex.readme.html
LDC96L14 - LDC Catalog

The second release of CELEX contains an enhanced, expanded version of the German lexical
database (2.5), featuring approximately 1000 new lemma entries, ...
www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId..
21 July 2015
Veton Këpuska
55
Requirements of Morphological Parser
 In the next section we will present:
 Representation of a simple lexicon for the
sub-problem of morphological recognition
 Build FSAs to model morphotactic
knowledge
 Finite-State Transducer (FST) is introduced
as a way of modeling morphological features
in the lexicon
21 July 2015
Veton Këpuska
56
Outline
1
2
3
4
5
6
7
8
9
21 July 2015
• Introduction
• Survey of English Morphology
• Inflectional Morphology
• Regular Verbs
• Derivational Morphology
• Cliticization
• Non-Concatenative Morphology
• Finite State Morphological Parsers
• Constructing Finite State Lexicon
Veton Këpuska
57
Constructing a
Finite-State Lexicon
Lexicon
 A lexicon is a repository for words.

The simplest possible lexicon would consist of an explicit list of
every word of the language
Every word means including

Example:

 Abbreviations: AAA
 Proper Names: Jane, Beijing, Lydra, Saranda, Zana, Gent, etc.
 a, AAA, AA, Aachen, aardvark, aardwolf, aba, abaca, aback, …
 For the various reasons in general it will be
inconvenient or even impossible to list every word in
a language.
 Computational lexicons are usually structured with


a list of each of the stems and affixes of the language.
Representation of the morphotactics
 One of the most common way to model morphotactics
is with the finite-state-automaton.
21 July 2015
Veton Këpuska
59
Example of FSA for English nominal
inflection
reg-noun
q0
Start State
plural-s
q1
q2
irreg-pl-noun
irreg-sg-noun
 This FSA assumes that the lexicon includes


regular nouns (reg-nouns), that take
Regular –s plural: cat, dog, aardvark

irregular noun forms that don’t take –s; both:
 Ignoring for now that the plural of words like fox have
inserted e: foxes.
 singular (irreg-sg-noun): goose, mouse, sheep
 plural (irreg-pl-noun): geese, mice, sheep
21 July 2015
Veton Këpuska
60
Examples of Verb Inflections
reg-noun
irreg-pl-noun irreg-sg-noun
plural
fox
geese
goose
-s
cat
sheep
sheep
aardvark
mice
mouse
21 July 2015
Veton Këpuska
61
Example for English verbal inflection
Irreg-past-verb-form
past (-ed)
q0
reg-verb-stem
q1
q3
past paticiple (-ed)
reg-verb-stem
irreg-verb-stem
pres part (-ing)
q2
 This lexicon has three
stem classes:
 reg-verb-stem
 irreg-verb-stem
 irreg-past-verb-form
21 July 2015
3sg(-s)
 Four affix classes:




Veton Këpuska
-ed: past
-ed: participle
-ing: participle
-s: third case singular
62
Examples of Verb Inflections
reg-verbstem
irregverbstem
irregpast-verb
past
pastpart
prespart
3sg
walk
cut
caught
-ed
-ed
-ing
-s
fry
speak
ate
talk
sing
eaten
impeach
21 July 2015
sang
Veton Këpuska
63
English Derivational Morphology
 As it has been discussed earlier in this
chapter, English derivational morphology is
significantly more complex than English
inflectional morphology.
 FSA for that model derivational morphology
are thus tend to be quite complex.
 Some models of English derivation are
based on the more complex context-free
grammars.
21 July 2015
Veton Këpuska
64
Morphotactics of English Adjectives

Example of Simple Case of Derivation from Antworth (1990):


big, bigger, biggest
happy, happier, happiest,
happily
unhappy unhappier,
unhappiest, unhappily
clear, clearer, clearest,
clearly, unclear, unclearly


cool, cooler, coolest, coolly
red, redder, reddest
real, unreal, really
adj-root
unq0



q1
-er -est -ly
q2
q3
e
21 July 2015
Veton Këpuska
65
Problem Issues
 While previous slide FSA will
recognize all the adjectives in the table
presented earlier,
 it will also recognize ungrammatical forms like:
 unbig, unfast, oranger, or smally

 adj-root would include adjectives that:
1. can occur with un- and –ly: clear, happy and
real
2. can not occur: big, small, etc.
 This simple example gives un idea of the
complexity to be expected from English
derivation.
21 July 2015
Veton Këpuska
66
Derivational Morphology Example 2

FSA models a number of derivational facts:



generalization that any verb ending in –ize can be followed by the nominalizing
suffix –ation
-al or –able → -ity or -ness
Exercise 3.1 to discover some of the individual exceptions to many of these
constructs.
-ize/V
nouni

Example:






fossil → fossilize → fossilization
equal → equalize → equalization
formal → formalize → formalization
realize → realizable → realization
natural → naturalness
casual → casualness
q0
q1
-able/A
q5
-er/N
q6
-ness/N
verbj
q10
nounl
Veton Këpuska
-ly/Adv
adj-ous
q8
verbk
21 July 2015
q4
-ity/N
adj-al
-ive/A
FSA models for another fragment
of English derivational
morphology
q3
q2
adj-al
q7

-ation/N
-ness/N
q9
-ly/Adv
-ative/A
-ful/A
q11
67
Solving the Problem of Morphological
Recognition
 Using FSAs above one could solve the
problem of Morphological Recognition;
 Given an input string of letters does it constitute
legitimate English word or not.
 Taking morphotactic FSAs and plugging in
each “sub-lexicon” into the FSA.
 Expanding each arc (e.g., reg-noun-stem arc)
with all morphemes that make up the stem of
reg-noun-stem.
 The resulting FSA can then be defined at the
level of the individual letter.
21 July 2015
Veton Këpuska
68
Solving the Problem of Morphological
Recognition


Noun-recognition FSA produced by expanding the Nominal
Inflection FSA of previous slide with sample regular and irregular
nouns for each class.
We can use the FSA in the figure below to recognize strings like
aardvarks by simply starting at the initial state, and comparing the
input letter by letter with each word on each outgoing arc, and so
on, just as we saw in Ch. 2.
reg-noun
q0
Start State
plural-s
q1
o
q2
x
f
irreg-pl-noun
a
c
s
t
e
irreg-sg-noun
g
o
o
s
e
e
e
e
s
Expanded FSA for a few English nouns with their inflection. Note that this automaton will
incorrectly accept the input foxs. We will see, beginning on page 62 (ref book), how to correctly
deal with the inserted e in foxes.
21 July 2015
Veton Këpuska
69
Outline
9
10
11
12
13
14
15
16
17
21 July 2015
• Finite State Transducers
• FST for Morphological Parsing
• Transducers and Orthographic Rules
• Combining FST Lexicon and Rules
• Lexicon-Free FSTs: The Porter Stemmer
• Word and Sentence Tokenization
• Detecting and Correcting Spelling Errors
• Human Morphological Processing
• Summary
Veton Këpuska
70
Finite-State
Transducers
Finite-State Transducers (FST)
 FSA can represent morphotactic structure of a lexicon and
thus can be used for word recogntion.
 In this section we will introduce the finite-state transducers
and we will show how they can be applied to morphological
parsing.
 A transducers maps between one representation and
another;




A finite-state transducer or FST is a type of finite
automaton which maps between two sets of symbols.
FST can be visualized as a two-tape automaton which
recognizes or generates pairs of strings.
Intuitively, we can do this by labeling each arc in the the FSM
(finite-state machine) with two symbol strings, one from each
tape.
In the figure in the next slide an FST is depicted where each
arc is labeled by an input and output string, separated by a
colon.
21 July 2015
Veton Këpuska
72
A Finite-State Transducer (FST)
aa:b
b:a
b:0
q0
b:b
q1
a:ba
 FST has a more general function than an FSA;
 Where an FSA defines a formal language by defining
a set of strings, an FST defines a relation between
sets of strings.
 FST can be thought of as a machine that reads one
string and generates another.
21 July 2015
Veton Këpuska
73
A Finite-State Transducer (FST)
 FST as recognizer:

A transducer that takes a pair of strings as input, and
outputs:
 accept if the string-pair is in the string-pair language,
and
 reject if it is not.
 FST as a generator:

A machine that outputs pairs of strings of the language
and outputs:
 yes or no, and
 a pair of output string.
 FST as translator:

A machine that reads a string and outputs another
string.
 FST as a set relater:

A machine that computes relations between sets.
21 July 2015
Veton Këpuska
74
A Finite-State Transducer (FST)
 All four categories of FST in previous slide have
applications in speech and language
processing.
 Morphological parsing (and for many other
NLP applications):
 Apply FST translator metaphor:
 Input: a string of letters
 Output: a string of morphemes.
21 July 2015
Veton Këpuska
75
Formal Definition of FST
Q
A finite set of N states q0, q1, …, qN-1,
S
A finite set corresponding to the input alphabet
∆
A finite set corresponding to the output alphabet
q0 ∈ Q
The start state
F⊆Q
The set of final states
d(q,w)
The transition function or transition matrix between states; Given a state
q ∈ Q and a string w ∈ S*. d(q,w) returns a set of new states Q’ ∈ Q. d is
thus a function from Q x S*, to 2Q (because there are 2Q possible subsets
of Q). d returns a set of states rather than a single state because a given
input may be ambiguous in which state it maps to.
s(q,w)
The output function giving the set of possible output strings for each
state and input. Given a state q ∈ Q and string w ∈ S*, s(q,w) gives a set
of output strings, each string o ∈ D*. s is thus a function from Q x S* to 2D*
21 July 2015
Veton Këpuska
76
Properties of FST
 FSAs are isomorphic to regular languages ⇔ FSTs are
isomorphic to regular expressions.
 FSTs and regular expressions are closed under union
 Generally FSTs are not closed under difference,
complementation and intersection.
 In addition to union, FSTs have two closure properties
that turn out to be extremely useful:
 Inversion: The inversion of a transducer T(T-1)
simply switches the input and output labels. Thus if T
maps from the input alphabet I to the output
alphabet O, T-1 maps from O to I.
 Composition: If T1 is a transducer from I1 to O1 and
T2 a transducer from O1 to O2, then T1∘T2 maps from
I1 to O2.
21 July 2015
Veton Këpuska
77
Properties of FST
 Inversion is useful because it makes it easy
to convert a FST-as-parser into an FST-asgenerator.
 Composition is useful because it allow us to
take two transducers that run in series and
replace them with one more complex
transducer.
 Composition works as in algebra:
 Applying T1∘T2 to an input sequence S is
identical to applying T1 to S and then T2 to the
resulting sequence; thus T1∘T2(S) = T2(T1(S))
21 July 2015
Veton Këpuska
78
FST Composition Example
a:b
b:c
a:b
q0
a:c
b:c
q1
q0
a:c
q1
=
q0
q1
 The composition of [a:b]+ with [b:c]+ to
produce [a:c]+
21 July 2015
Veton Këpuska
79
Projection
 The Projection of an FST is the FSA that is
produced by extracting only one side of the
relation.
 We refer to the projection to the left or
upper side of the relation as the upper or
first projection and the projection to the
lower or right side of the relation as the
lower or second projection.
21 July 2015
Veton Këpuska
80
Sequential Transducers and Determinism
 Transducers as have been described may
be nondeterministic; given an input there
may be many possible output symbols.
 Thus using general FSTs requires the kinds of
search algorithms discussed in Chapter 1, which
makes FSTs quite slow in the general case.
 This fact implies that it would be nice to have an
algorithm to convert a nondeterministic FST to a
deterministic one.
 Every non-deterministic FSA is equivalent to
some deterministic FSA
 Not all FSTs can be determinized.
21 July 2015
Veton Këpuska
81
Sequential Transducers
 Sequential transducers are a subtype of
transducers that are deterministic on their
input.
 At any state of a sequential transducer, each
given symbol of the input alphabet S can label
at most one transition out of that state.
q0
a:b
b:0
b:b
q1
a:ba
21 July 2015
Veton Këpuska
82
Sequential Transducers
 Sequential transducers are not necessarily sequential
on their output. In example of FST in previous slide,
two distinct transitions leaving from state q0 have the
same output (b).

Inverse of a sequential transducer may thus not be
sequential, thus we always need to specify the direction
of the transduction when discussing sequentiality.
 Formally, the definition of sequential transducers
modifies the d and s functions slightly:


d becomes a function from Q x S* to Q (rather than to
2Q), and
s becomes a function from Q x S* to D* (rather than
2D*).
 Subsequential transducer is one generalization of
sequential transducer which generates an additional
output string at the final states, concatenating it onto
the output produced so far.
21 July 2015
Veton Këpuska
83
Importance of Sequential and
Subsequential Transducers

Sequential and Subsequential FSTs are:

efficient because they are deterministic on input →
they can be processed in time proportional to the number of symbols in
the input (linear in their input length) rather then being proportional to
some much larger number which is function of the number of states.

Efficient algorithms for detrminization exists for Subsequential
transducers (Mohri 1997) and for their minimization (Mohri, 2000).
21 July 2015
Veton Këpuska
84
Importance of Sequential and
Subsequential Transducers

While Sequential and Subsequential FSTs are deterministic and
efficient, neither of them is able to handle ambiguity; they
transduce each input string to exactly one possible output string.

Since ambiguity is a crucial property of natural language, it will be
useful to have an extension of subsequential transducers that can
deal with ambiguity, but still retain the efficiency and other useful
properties of sequential transducers.

One such generalization of subsequential transducers is: psubsequential transducer.


A p-subsequential transducer allows for p(p≥1) final output strings to
be associated with each final state (Mohri 1996).
They can handle a finite amount of ambiguity, which is useful for many
NLP tasks.
21 July 2015
Veton Këpuska
85
2-subsequential FSA (Mohri 1997)
q1
a:a
a:a
q0
a
q3
b:a
b:b
b
q2
 Mohri (1996, 1997) shows a number of tasks whose
ambiguity can be limited in this way, including the
representation of dictionaries, the compilation of
morphological and phonological rules, and local syntactic
constraints. For each of these kinds of problems, he and
others have shown that they are p-subsequentializable,
and thus can be determinized and minimized. This class of
transducers includes many, although not necessarily all,
morphological rules.
21 July 2015
Veton Këpuska
86
FSTs for
Morphological
Parsing
Task of Morphological Parsing
 Example: cats → cat+N+PL
 In the finite-state morphology paradigm, we
represent a word as correspondence between:


A lexical level – which represents a concatenation of
morphemes make up a word, and
The surface level – which represents the concatenation of
letters which make up the actual spelling of the word
Lexical
c
a
t
+N
Surface
c
a
t
s
+PL
Schematic examples of the lexical and surface tapes; the actual transducers will involve
intermediate tapes as well.
21 July 2015
Veton Këpuska
88
Lexical Tape

For finite-state morphology it is convenient to view an FST as having
two tapes.
 The upper or lexical tape is composed from characters from one
(input) alphabet S.
 The lower or surface tape, is composed of characters from
another (output) alphabet D.

In the two-level morphology of Koskenniemi (1983), each arc is
allowed to have only a single symbol from each alphabet.
 Two symbol alphabets can be obtained by combining alphabets S
and D to make a new alphabet S’.
 New alphabet S’ makes the relationship to FSAs clear:
 S’ is a finite alphabet of complex symbols:
 Each complex symbol is composed of an input-output pair
i:o;
 i – one symbol from input alphabet S
 o – one symbol from output alphabet D
 S’ ⊆ S x D
 S and D may each also include the epsilon symbol e.
21 July 2015
Veton Këpuska
89
Lexical Tape
 Comparing FSA to FST modeling morphological
aspect of a language the following can be
observed:
 FSA – accepts a language stated over a
finite alphabet of single symbols; e.g. Sheep
language:
 S={b,a,!}
 FST as defined in previous slides accepts a
language stated over pairs of symbols, as
in:
 S’={a:a, b:b, !:!, a:!, a:e, e:!}
21 July 2015
Veton Këpuska
90
Feasible Pairs
 In two-level morphology, the pairs of
symbols in S’ are also called feasible
pairs.
 Each feasible pair symbol a:b in the transducer
alphabet S’ expresses how the symbol a from
one tape is mapped to the symbol b on the
other tape.
 Example: a:e means that an a on the upper
tape will correspond to nothing on the lower
tape.
 We can write regular expressions in the
complex alphabet S’ just as in the case of
FSA.
21 July 2015
Veton Këpuska
91
Default Pairs
 Since it is most common for symbols to map to
themselves (in two-level morphology) we call pairs
like a:a default pairs, and just refer to them by
the single letter a.
21 July 2015
Veton Këpuska
92
FST Morphological Parser
 From morphotactic FSAs covered earlier, and by
adding Lexical tape and the appropriate
morphological features we can build a FST
morphological parser.
 In the next slide is presented a figure that is
augmentation of the figure in the slide FSA for
English Nominal Inflection with nominal
morphological features (+Sg and +Pl) that
correspond to each morpheme.
 The symbol ^ indicates a morpheme boundary,
while
 The symbol # indicates a word boundary.
21 July 2015
Veton Këpuska
93
A Schematic Transducer for English
Nominal Number Inflection Tnum
reg-noun
q0
irreg-sg-noun
irreg-pl-noun
q1
+N
q2
+N
q3
+N
e
e
e
q4
+Sg
#
q5
q6
+Pl
^s#
+Sg
#
qq77
+Pl
#
 The symbols above of in each arc represent elements of
the morphological parse in the lexical tape.
 The symbols below each arc represent the surface tape
(or the intermediate tape, to be described later), using
the morpheme-boundary symbol ^ and word-boundary
marker #.
 The arcs need to be expanded by individual words in the
lexicon.
21 July 2015
Veton Këpuska
94
Transducer & Lexicon
 In order to use the Transducer in the previous slide as a
morphological noun parser it needs to be expanded with
all the individual regular and irregular noun stems:
replacing the labels reg-noun etc.
 This expansion can be done by updating the lexicon for
this transducer, so that irregular plurals like geese will
parse into the correct stem goose +N +Pl.
 This is achieved by allowing the lexicon to also have two
levels:
 Surface geese maps to lexical goose → new
lexical entry: “g:g o:e o:e s:s e:e”.
 Regular forms are simpler: two level entry for fox will
now be “f:f o:o x:x”
 Relying on the orthographic convention that f stands for
f:f and so on, we can simply refer to it as fox and the
form for geese as “g o:e o:e s e”.
21 July 2015
Veton Këpuska
95
Lexicon
reg-noun
irreg-pl-noun
irreg-sg-noun
fox
g o:e o:e s e
goose
cat
sheep
sheep
aardvark
m o:i u:e s:c e
mouse
o
o
1
f
f
c
c
0
g
g
3
2
a
a
x
x
t
t
4
5
+N
e
+Pl
^s#
6
+Sg
#
o
o
o
o
s
s
e
e
+N
e
o
e
+Sg
#
77
+Pl
#
o
e
s
s
e
e
+N
e
Expanded FST from Tnum for a few English nouns Tlex with their inflection. Note that this automaton will
incorrectly accept the input foxs. We will see later how to correctly deal with the inserted e in as in foxes.
21 July 2015
Veton Këpuska
96
FST
 The resulting transducer shown in previous slide will
map:


Plural nouns into the stem plus the morphological
marker +Pl
Singular nouns into the stem plus the morphological
marker +Sg.
 Example:




cats → cat +N +Pl
c:c a:a t:t +N: +Pl:^s#
Output symbols include the morpheme and word
boundary markers: ^ and # respectively. Thus the
lower labels do not correspond exactly to the surface
level.
Hence the tapes (output) with these morpheme
boundary markers is referred to as intermediate as
shown in the next figure.
21 July 2015
Veton Këpuska
97
Lexical and Intermediate Tapes
Lexical
f
o
x
+N
+PL
Intermediate
f
o
x
^
s
#
A schematic view of the lexical and intermediate tapes
21 July 2015
Veton Këpuska
98
Transducers and
Orthographic Rules
Transducers and Orthographic Rules
 The method described in previous section
will successfully recognize words like
aardvarks and mice.
 However, concatenating morphemes won’t
work for cases where there is a spelling
change:
 Incorrect rejection of the input like foxes, and
 Incorrect acceptance of the input like foxs.
 This is due to the fact that English language
often requires spelling changes at
morpheme boundaries:
 Introduction of spelling (or orthographic)
rules
21 July 2015
Veton Këpuska
100
Notations for Writing Orthographic Rules
 Implementing a rule in a transducer is important in
general for speech and language processing.
 The table bellow introduces a number of rules.
Name
Description of Rule
Example
Consonant
doubling
1-letter consonant doubled before –ing/-ed
beg/begging
E deletion
Silent -e dropped before –ing and –ed
make/making
E insertion
-e added after –s, -z, -x, -ch, -sh before -s
watch/watches
Y replacement
-y changes to –ie before –s, -i before -ed
try/tries
K insertion
Verbs ending with vowel + -c add –k
panic/panicked
21 July 2015
Veton Këpuska
101
Lexical → Intermediate → Surface
 An example of the lexical, intermediate, and surface
tapes. Between each pair of tapes is a two-level
transducer; the lexical transducer between lexical and
intermediate levels, and the E-insertion spelling rule
between the intermediate and surface levels. The Einsertion spelling rule inserts an e on the surface tape
when the intermediate tape has a morpheme boundary
^ followed by the morpheme -s
Lexical
f
o
x
+N
+PL
Intermediate
f
o
x
^
s
Surface
f
o
x
e
s
21 July 2015
Veton Këpuska
#
102
Orthographic Rule Example
 E-insertion rule from the table might be formulated as:

Insert an e on the surface tape just when the lexical tape
has a morpheme ending in x (or z, etc) and the next
morpheme is –s”. Bellow is formalization of this rule:
x
 
e  e /  s ^ _ s #
z
 
 This rule notation is due to Chomsky and Halle (1968);
and is of the form:

a → b/c_d; “rewrite a with b when it occurs between c and
d”.
21 July 2015
Veton Këpuska
103
Orthographic Rule Example
 Symbol e means an empty transition; replacing
it means inserting something.
 Morpheme boundaries ^ are deleted by default
on the surface level (^:e)
 Since # symbol marks a word boundary, the
rule in the previous slide means:
 Insert an e after a morpheme-final x, s, or
z, and before the morpheme s.
21 July 2015
Veton Këpuska
104
Transducer for E-insertion rule
other
q5
z, s, x
^:e
other
#
s
^:e
z, s, x
z, s, x
e:e
^:e
q1
q0
s
q2
q3
q4
z, x
#, other
#, other
#
State/Input
s:s
x:x
z:z
^:e
e:e
#
other
q0:
1
1
1
0
-
0
0
q1:
1
1
1
2
-
0
0
q2:
5
1
1
0
3
0
0
q3
4
-
-
-
-
-
-
q4
-
-
-
-
-
0
-
q5
1
1
1
2
-
-
0
21 July 2015
Veton Këpuska
105
Combining FST
Lexicon and Rules
Combining FST Lexicon and Rules
 Ready now to combine Lexicon and Rule
Transducers for parsing and generating.
 The figure below depicts the architecture of twolevel cascade of FSTs with lexicon and rules
f
o
x
+N
+Pl
LEXICON-FST
f
o
21 July 2015
^
s
Orthographic
Rules
FST1
f
x
o
x
e
#
FSTN
s
Veton Këpuska
107
Tlex FST Combined with Te-insert FST
o
o
1
f
f
c
c
0
g
g
2
a
a
3
x
x
t
t
4
5
+N
e
+Pl
^s#
6
+Sg
#
o
o
o
o
s
s
e
e
+N
e
+Sg
#
+Pl
#
e
e
e
e
s
Lexical
f
o
x
+N
+PL
+N
e
Tlex
0
1
2
5
6
7
Expanded FST from Tnum for a few English nouns Tlex with their inflection. Note that this automaton will
incorrectly accept the input foxs. We will see later how to correctly deal with the inserted e in as in foxes.
Intermediate
other
Te-insert
q5
z, s, x
^:e
other
#
Surface
s
e:e
^:e
q1
q0
0
o
0
f
x
0
o
^
1
x
s
2
e
3
#
4
0
s
^:e
z, s, x
z, s, x
f
q2
s
q3
q4
z, x
#, other
#, other
#
21 July 2015
Veton Këpuska
108
Finite-State Transducers
 The exact same cascade with the same state sequences is
used when the machine is:


Generating the surface tape from the lexical tape
Parsing the lexical tape from the surface tape.
 Parsing is slightly more complicated than generation,
because of the problem of ambiguity.

Example: foxes can also be a verb (meaning “to baffle or
confuse”), and hence lexical parse for foxes could be:
 fox +V +3Sg: That trickster foxes me every time, as well as
 fox +N +Pl: I saw two foxes yesterday.
21 July 2015
Veton Këpuska
109
Finite-State Transducers (cont)

For ambiguous cases of this sort the transducers is not capable
of deciding.

Disambiguation will require some external evidence such as
surrounding words. Disambiguation algorithms are discussed
in Ch 5, Ch. 20.

In lack of such external evidence the best that FST can do is
enumerate all possible choices – transducing fox^s# into both
fox +V +3Sg and fox +N +Pl.
21 July 2015
Veton Këpuska
110
FSTs and Ambiguation
 However, there is a kind of ambiguity that FSTs need to
handle:

Local ambiguity that occurs during the process of parsing.

Example: Parsing input verb assess.
 After seeing ass, E-insertion transducer may propose that the
e that follows is inserted by the spelling rule - as far as the
FST we might have been parsing the word - asses.
 Thus is not until we don’t see the # after asses but rather run
into another s, that we can realize that we have gone down an
incorrect path.
21 July 2015
Veton Këpuska
111
FSTs and Ambiguation
 Because of this non-determinism, FST-parsing
algorithm need to incorporate some sort of
search algorithm.
 Homework – Exercise 3.7 asks to modify the
algorithm for non-deterministic FSA
recognition to do FST parsing.
 Note that many possible spurious
segmentations of the input, such as parsing
assess as ^a^s^ses^s will be ruled out since
no entry in the lexicon will match this string.
21 July 2015
Veton Këpuska
112
FSTs and Cascading

Running a cascade of number of FST can be unwieldy1):
 Using cascading property of FSTs we can compose a single
more complex transducer from cascade of a number of
transducers in series.
1)unwieldy
[uhn-weel-dee]
Word Origin
adjective, unwieldier, unwieldiest.1.not wieldy; wielded
with difficulty; not readily handled or managed inuse or
action, as from size, shape, or weight; awkward; ungainly.
21 July 2015
Veton Këpuska
113
FSTs and automaton intersection

Transducers in parallel can be combined by automaton
intersection.



The automaton intersection algorithm just takes the
Cartesian product of the states, i.e., for each state qi in
machine 1 and state qj in machine 2, we create a new state
qij.
For any input symbol a, if machine 1 would transition to
sate qn and machine 2 would transition to state qm, the
new transducer will transition to state qnm.
The figure in the next slide sketches how the intersection (∧)
and composition (o) process might be carried out.
21 July 2015
Veton Këpuska
114
Intersection and Composition of
Transducers
LEXICON-FST
FST1
Orthographic Rules
21 July 2015
LEXICON-FST
}
FSTN
intersect
FSTA=(FST1^FST2^…^FSTN)
Veton Këpuska
}
compose
LEXICON-FST O FSTA
115
Intersection and Composition of
Transducers
 Since there are a number of rule → FST compilers.
 It is almost never necessary in practice to write an FST by
hand.



Kaplan and Key (1994) give the mathematics that define the
mapping from rules to two-level relations
Antworth (1990) gives details of the algorithms for rule
compilation.
Mohri (1997) gives algorithms for transducer minimization and
determinization
21 July 2015
Veton Këpuska
116
Lexicon-Free FSTs:
The Porter Stemmer
21 July 2015
Veton Këpuska
117
Lexicon-Free FSTs: The Porter Stemmer
 Building the FSTs from a lexicon plus rules is the
standard algorithm for morphological parsing.
 However, there are simpler algorithms that do not
require the large on-line lexicon demanded by this
standard algorithm.
21 July 2015
Veton Këpuska
118
 These new algorithms are especially used in
Information Retrieval (IR) tasks like web search:
 Boolean combination of relevant keywords or
phrases:
marsupial OR kangaroo OR koala
 Since a document with the word marsupials might not
match the keyword marsupial, some IR systems first
run a stemmer on the query and document words.
 Morphological information in IR is thus only
used to determine that two words have the
same stem; the suffixes are thrown away.
21 July 2015
Veton Këpuska
119
Porter (1980) Stemming Algorithm
 Porter Stemming Algorithm is based on a
series of simple cascaded rewrite rules.
 Considering that cascaded rewrite rules can be easily
implemented as an FST, the Porter algorithm is
thought of as a lexicon-free FST stemmer
(Homework Exercise 3.6).
 The algorithm contains rules like this:
 ATIONAL → ATE (e.g., relational → relate)
 ING → e if stem contains vowel (e.g, motoring →
motor)
 SSES → SS (e.g., grasses → grass)
 More details from:
http://tartarus.org/~martin/PorterStemmer/index.html
21 July 2015
Veton Këpuska
120
Porter (1980) Stemming Algorithm
 Not all IR engines use stemming, partly
because of stemmer errors such as these
noted by Krovetz:
Errors of Commission
Errors of Omission
organization
organ
European
Europe
doing
doe
analysis
analyzes
generalization generic
matrices
matrix
numerical
numerous
noise
noisy
policy
police
sparse
sparisty
21 July 2015
Veton Këpuska
121
Word and Sentence
Tokenization
Word and Sentence Tokenization
 Focused so far on the problem of
segmentation:
 How words can be segmented into morphemes.
 Related problem is of segmenting running
text into words and sentences:
tokenization.
21 July 2015
Veton Këpuska
123
Word and Sentence Tokenization
 Word tokenization may seem very simple in a
language like English that separates words via
special ‘space’ character.
 However, not every language does this: e.g.
 Chinese,
 Japanese, and
 Thai
21 July 2015
Veton Këpuska
124
Why Segmentation is a Problem?
1. Whitespace character is not sufficient by itself:

Mr. Sherwood said reaction to Sea Containers’ proposal
has been “very positive.” In New York Stock Exchange
composite trading yesterday, Sea Containers closed at
$62.625, up 62.5 cents. “I said, ‘what’re you? Crazy?’ “
said Sadowsky. “I can’t afford to do that.”

Segmenting purely on white-space would produce
words like these:




21 July 2015
cents.
said,
positive,”
Crazy?
Veton Këpuska
125
Why Segmentation is a Problem?
(cont)
2. In speech recognition systems:
 the whitespace character is not produced by the
recognizer.
 Tokens are phonetic labels (not orthographic): AE
 Incomplete number of phones (recognizer deletion)
 Insertion of non-existing phone (recognizer insertion)
 Erroneous phone label (recognizer substitution)
21 July 2015
Veton Këpuska
126
Segmentation Errors

Segmentation errors produced by using only white-space could be
addressed by treating punctuation as word boundary in addition to
white-space.
21 July 2015
Veton Këpuska
127
Segmentation Errors (cont.)

Punctuation however often occurs word internally:



m.p.g, Ph.D, AT&T, cap’n, 01/02/06, google.com, etc.
Also, if we want for 62.5 in the example in previous slide to be a word,
segmenting every period “.” must be avoided.
Number expressions introduce additional problems when
expressed in numeric form or as words:



777,777.77
seven hundred and seventy seven dollars and seventy seven cents.
9/15/2007
21 July 2015
Veton Këpuska
128
Segmentation Errors (cont.)

Languages differ on punctuation styles for numbers, dates.

European continental languages use comma to mark the decimal
“point” and spaces or sometimes periods where English language puts
commas:
 777 777,77 or
 777.777,77
and
 15.09.2007
21 July 2015
Veton Këpuska
129
Other Tokenizer Tasks
 Expansion of clitic contractions that are marked
by apostrophes:
 what’re → what are
 we’re → we are
21 July 2015
Veton Këpuska
130
Other Tokenizer Tasks (cont.)
 Complications:
 Apostrophes are ambiguous since they are
used as
 genitive markers: “the book’s cover” or
“Containers’ “ in the previous example in
the Segmentation Problem slide.
 quantative markers: “ ‘what’re you?
Crazy?’ “
21 July 2015
Veton Këpuska
131
Other Tokenizer Tasks
 Multi-word expressions:


New York
Rock ‘n’ roll
 Requires multiword expression dictionary, also
 Must detect names, dates, organizations (names) – named
entity detection (Ch22)
21 July 2015
Veton Këpuska
132
Other Tokenizer Tasks (cont.)
 Sentence Segmentation – crucial first step in text
processing.




Segmentation of a text into sentences is generally based on
punctuation:
Sentence boundaries are marked usually with: “.”, “?”, “!”.
“?”, “!” are relatively unambiguous compared to “.”
 Example: Sentence boundary “.” vs “Mr.” or “Inc.”
 Previous example used “Inc.” to mark abbreviation as well as
the sentence boundary.
Consequently word tokenization and sentence tokenization
tend to be addressed jointly.
21 July 2015
Veton Këpuska
133
Sentence Tokenization
 Typically sentence tokenization methods work by building a
binary classifier:
 Based on sequence of rules, or

Based on some machine learning algorithm (introduced in
the subsequent chapters)
Deciding if a period is part of the word or is a sentence boundary
marker.


Abbreviation dictionary is useful in determining if a period is
attached to a commonly used abbreviation.
Useful first step in sentence tokenization can be taken via a
sequence of regular expressions. Introduction of the first part of
this algorithm: a word tokenization is presented in the next slide
implemented in the perl script.

Free scripting language can be downloaded from:
http://aspn.activestate.com/ASPN/Downloads/ActivePerl/
21 July 2015
Veton Këpuska
134
Word Tokenization with Perl
#!/usr/bin/perl
$letternumber = "[A-Za-z0-9]";
$notletter = "[ˆA-Za-z0-9]";
$alwayssep = "[\\?!()\";/\\|‘]";
$clitic = "(’|:||’S|’D|’M|’LL|’RE|’VE|N’T|’s|’d|’m|’ll
|’re|’ve|n’t)";
$abbr{"Co."} = 1; $abbr{"Dr."} = 1;
$abbr{"Jan."} = 1; $abbr{"Feb."} = 1;
while (<>){
# put whitespace around unambiguous separators
# now deal with periods. For each possible word
@possiblewords=split(/\s+/,$_);
foreach $word (@possiblewords) {
# if it ends in a period,
if (($word =˜/$letternumber\./)
&& !($abbr{$word})
# and isn’t on the abbreviation list
# and isn’t a sequence of letters and periods (U.S.)
# and doesn’t resemble an abbreviation (no vowels: Inc.)
&& !($word =˜/ˆ([A-Za-z]\.([A-Zaz]\.)+|[A-Z][bcdfghjnptvxz]+\.)$/)) {
s/$alwayssep/ $& /g;
#
put whitespace around commas that aren’t inside numbers
s/([ˆ0-9]),/$1 , /g;
s/,([ˆ0-9])/ , $1/g;
# then segment off the period
$word =˜s/\.$/ \./;
}
# distinguish singlequotes from apostrophes by
# expand clitics
# segmenting off single quotes not preceded by letter
$word =˜
s/’ve/have/;
$word =˜
s/’m/am/;
print $word," ";
s/ˆ’/$& /g;
s/($notletter)’/$1 ’/g;
# segment off unambiguous word-final clitics and punctuation
s/$clitic$/ $&/g;
s/$clitic($notletter)/ $1 $2/g;
}
print "\n";
}
21 July 2015
Veton Këpuska
135
Software Tools
 Perl sentence parsing module Sentence.pm
 http://search.cpan.org/~shlomoy/Lingua-ENSentence-0.25/lib/Lingua/EN/Sentence.pm
 Finite State Morphology
 Book:
 http://www.stanford.edu/~laurik/fsmbook/home.html
 Software:
 http://www.stanford.edu/~laurik/.book2software/
21 July 2015
Veton Këpuska
136
Sentence Tokenization
 The fact that the simple tokenizer can be build with such
simple regular expression patterns like the one in the
previous perl example suggest that they can be easily
implemented in FSTs.
 The examples of FSTs that have been built:
 Karttunen et al. - 1996
 Beesley and Karttunen – 2003, give descriptions of
such FST-based tokenizers.
21 July 2015
Veton Këpuska
137
Detecting and Correcting Spelling
Errors
Detecting and
Correcting Spelling
Errors
Detecting and Correcting Spelling Errors
 The problem of detecting and correcting spelling
errors is introduced.
 Since the standard algorithm for spelling error
correction is probabilistic, we will continue our spellchecking discussion in Ch. 5 after the probabilistic
noisy channel is introduced.
 The detection and correction of spelling errors is an
integral part of
 Modern word processors and search engines
 In optical character recognition (OCR),
 Automatic recognition of machine or hand-printed
characters, and
 On-line handwriting recognition, printed or cursive
handwriting.
21 July 2015
Veton Këpuska
139
Spell-Checking in Increased Problem
Broadness
 Non-word error detection: detecting spelling errors
that result in non-words (e.g., graffe for giraffe)
 Isolated-word error correction: correcting spelling
errors that result in non-words (e.g., correcting
graffe to giraffe but looking only at the word in
isolation)
 Context-dependent error detection and
correction: using the context to help detect and
correct spelling errors even if they accidentally result
in an actual word of English (real-word errors). This
happen from typographical errors (insertion, deletion,
transposition) which accidentally produce a real word
(e.g., there for three), or because the writer
substituted the wrong spelling of a homophone or
near-homophone (e.g., dessert for desert, or piece
for peace)
21 July 2015
Veton Këpuska
140
Non-Word Error Detection
 Marking any word that is not found in a dictionary.
 Dictionary would not have a word entry graffe.
 Early research (Peterson 1986) suggested that such
spelling dictionaries would need to be kept small:
 Large dictionaries contain rare words that resemble
misspellings of other words.
 Example:
 wont or veery.
 won’t and very.
 Larger dictionary proved more help than harm by
avoiding marking rare words as errors in spite the fact
that some misspellings were hidden by real words
(Damerau and Mays 1989).
 Modern spell-checking systems tend to be based on
large dictionaries.
21 July 2015
Veton Këpuska
141
Dictionary Implementation
 A finite-state morphological parsers described throughout
this chapter provide a technology for implementing such
large dictionaries.



By giving a morphological parser for a word, an FST parser is
inherently a word recognizer.
An FST morphological parser can be turned into an even more
efficient FSA word recognizer by using the projection
operation to extract the lower-side language graph.
Such FST dictionaries also have the advantage of representing
productive morphology:
 English –s and –ed inflections described previously.
 Important for dealing with new legitimate combinations of
stems and inflections.


21 July 2015
New stem can be added to the dictionary and then all the inflected
forms are easily recognized.
This makes FST dictionaries especially powerful for spell-checking in
morphologically rich languages where a single stem can have tens
of hundreds of possible surface forms.
Veton Këpuska
142
Dictionary Implementation

FST dictionaries can help with non-word error detection. But
how about error correction?

Algorithms for isolated-word error correction operate by finding
words which are the likely source of the erroneous form:
 Correcting spelling error graffe requires:
 Searching through all possible options like:
 giraffe, graph, craft, grail, etc.
 Selection of the best choices from all possible ones:
 Computation of distance metric between source
and the surface error.
 Intuitively giraffe is a more likely source than grail for
graffe.
The most powerful way to capture this similarity intuition
requires use of the probability theory and will be discussed in
Ch. 5. However this algorithm is based on the non-probabilistic
minimum edit distance algorithm that is introduced next.

21 July 2015
Veton Këpuska
143
FST’s for Dictionaries
Minimum Edit
Distance
21 July 2015
Veton Këpuska
144
Minimum Edit Distance
 Deciding which two words is closer to some third word
in spelling is a special case of the general problem of
string distance.
 The distance between two strings is a measure of how
alike are to each other.
 Class of algorithm that solve this problems is know as
minimum edit distance.
 Minimum edit distance between two strings is the
minimum number of editing operations:
 Insertions
 Deletions
 Substitutions
Needed to transform one string into the other.
21 July 2015
Veton Këpuska
145
Minimum Edit Distance
I
N
T
D
S
S
*
E
X



E
E
*
N
I
S
C
U
T
I
O
N
T
I
O
N
Representing the minimum edit distance between two strings as an
alignment. The alignment is achieved with operations of
 Deletion (D)
 Insertion (I)
 Substitution (S)
To get a numeric representation for minimum edit distance we can
assign a particular cost or weight to each of these operations.
Levenshtein distance between two sequences is the simplest weighting
factor in which each of the three operations has a cost of 1
(Levenshtein 1966).
 In the example above the distance between intention and
execution is 5.
 Another version will weigh the D and I with a cost of “1” and S
with a cost of “2” making the distance above equal to 8.
21 July 2015
Veton Këpuska
146
Minimum Edit Distance
 The minimum edit distance is computed by
dynamic programming.
 Dynamic programming is the name for a
class of algorithms first introduced by
Bellman (1957) 50+ years of ago.
 It applies a table-driven method to solve
problems by combining solutions to subproblems.
 This class of algorithms includes most
commonly-used algorithms in speech and
language processing like:
 Viterbi & Forward (Ch. 6)
 CYK and Earley (Ch. 13)
21 July 2015
Veton Këpuska
147
Distance Matrix
I
N
T
D
S
S
*
E
X
E
E
*
N
I
S
C
U
T
I
O
N
T
I
O
N
N
O
I
T
N
S
E
T
N
I
I
S
S
D
* E X E C U T
21 July 2015
Veton Këpuska
I O N
148
DP Algorithm Details
E
C
H
 A char-char matrix illustrates the alignment process
E
i-1, j
S
P
i-1
j-1 i, j-1
S
21 July 2015
s
P
E
E
Veton Këpuska
h
H
149
DP Algorithm Details
 One way to find the path that yields the best match
(i.e., lowest global distance) between the test string
and a given reference string is by evaluating all
possible paths in the matrix.
 That approach is time consuming because the number
of possible paths is exponential with the length of the
utterance.
 The matching process can be shortened by requiring
that
1. A path cannot go backwards; (i.e., to the left or down
in the Figure)
2. A path must include an association for every
character in the test string, and
3. Local distance scores are combined by adding to give
a global distance.
21 July 2015
Veton Këpuska
150
DTW - DP Algorithm Details
 For the moment, assume that a path must include an
association for every frame in the template and every
frame in the utterance.
 Then, for a point (i, j) in the time-time matrix (where
i indexes the utterance frame, j the template frame),
the previous point must have been (i-1, j-1), (i-1, j)
or (i, j-1) (because of the prohibition of going
backward in time).
(i-1,j)
(i,j)
j
(i,j-1)
(i-1,j-1)
i
21 July 2015
Veton Këpuska
151
DTW - DP Algorithm Details
 The principle of dynamic programming (DP) is that a point
(i, j), the next selected point on the path, comes from one
among (i-1, j-1), (i-1, j) or (i, j-1) that has the lowest
distance.
 DP finds the lowest distance path through the matrix, while
minimizing the amount of computation.
 The DP algorithm operates in a time-synchronous manner
by considering each column of the time-time matrix in
succession (which is equivalent to processing the utterance
frame-by-frame).
 For a template of length N (corresponding to an N-row
matrix), the maximum number of paths being considered at
any time is N.
 A test utterance feature vector j is compared to all
reference template features, 1…N, thus generating a vector
of corresponding local distances d(1, j), d(2, j), … d(N, j).
21 July 2015
Veton Këpuska
152
DP Algorithm Details
 If D(i, j) is the global distance up to, but not including
matrix point (i, j) and the local distance of (i, j) matching
character i of the test string with character j of reference
string is given by d(i, j), then
D(i, j) = min [D(i-1, j-1), D(i-1, j), D(i, j-1)] + d(i, j)
 Given that D(1, 1) = d(1, 1) (this is the initial condition),
we have the basis for an efficient recursive algorithm for
computing D(i, j).
 The final global distance D(M, N) at the end of the path
gives the overall lowest matching score of the template with
the utterance, where M is the number of vectors of the
utterance.
 The test string is then matched against the dictionary word
with a lowest matching score producing the minimum edit
distance.
21 July 2015
Veton Këpuska
153
DP Algorithm Details
 Equation presented in the previous slide enforces the
rule that the only directions in which a path can move
when at (i, j) in the time-time matrix is:
 up,
 right, or
 diagonally up and right.
 Computationally, that equation is in a form that could
be recursively programmed. However, unless the
language is optimized for recursion, this method can
be slow even for relatively small pattern sizes.
 Another method that is both quicker and requires less
memory storage uses two nested "for" loops. This
method only needs two arrays that hold adjacent
columns of the time-time matrix.
21 July 2015
Veton Këpuska
154
DP Algorithm Details


Referring to next Figure, the algorithm to find the least global
distance path is as follows (Note that from the Figure, which
shows a representative set of rows and columns, the cells at (i,
j) and (i, 0) have different possible originator cells. The path to
(i, 0) can originate only from (i-1, 0).
But the path to any other (i, j) can originate from the three
standard locations):
i-1, i, j
j
i-1
j-1 i, j-1
Prev. Col.:
i-1
21 July 2015
Cur.
Column: i
Veton Këpuska
155
DP Algorithm Details
1.
Calculate the global distance for the bottom most cell of the leftmost column, column 0.



2.

Calculate the global distance to the bottom most cell of the next
column, column 1 (which is designated the curCol, for current
column).

3.
5.
The global distance to that bottom most cell is the local distance for
that cell plus the global distance to the bottom most cell of the
predecessor column.
Calculate the global distance of the rest of the cells of curCol.

4.
The global distance up to this cell is just its local distance.
The calculation then proceeds upward in column 0.
The global distance at each successive cell is the local distance for that
cell plus the global distance to the cell below it.
Column 0 is then designated the predCol (predecessor column).
For example, at cell (i, j) this is the local distance at (i, j) plus the
minimum global distance at either (i-1, j), (i-1, j-1) or (i, j-1).
curCol becomes predCol and step 2 is repeated until all columns
have been calculated.
Minimum global distance is the value stored in the top most cell
of the last column.
21 July 2015
Veton Këpuska
156
DP Algorithm Pseudo Code
 The pseudocode for this process is:
Calculate First Column Distances (Prev. Column)
for i=1 to Number of Input Feature Vectors
CurCol[0] = local distance at (i,0)
+ global cost at (i-1,0)
for j=1 to Number of Template Feature Vectors
CurCol[j] = local distance at (i,j)
+ minimal global cost at (i-1,j), (i-1,j-1), (i,j-1)
end
end
21 July 2015
Veton Këpuska
157
DP Algorithm Details
 To perform recognition on an utterance, the
algorithm is repeated for each template.
 The template file that gives the lowest
global matching score is picked as the most
likely word.
 Note that the minimum global matching
score for test word need be compared to all
dictionary candidates.
 The candidate words can be sorted based
on minimum edit distance (DP score)
21 July 2015
Veton Këpuska
158
DP Algorithm Details
 Note that the minimum global matching
score for test word need be compared to all
dictionary candidates.
 The candidate words can be sorted based
on minimum edit distance (DP score)
insertion_ cost (target i 1 )
 D (i  1, j ) 

D (i , j )  min  D (i  1, j  1 )  substituti on_cost (source j 1 ,target
 D (i , j  1 ) 
deletion_c ost (source j 1 )

21 July 2015
Veton Këpuska
i1
159
)
Distance Computation
function MIN-EDIT-DISTANCE(target, source) returns min-distance
n←LENGTH(target)
m←LENGTH(source)
Create a distance matrix distance[n+1,m+1]
Initialize the zeroth row and column to be the distance from the
empty string
distance[0,0] = 0
for each column i from 1 to n do
distance[i,0]←distance[i-1,0] + ins-cost(target[i])
for each row j from 1 to m do
distance[0,j]←distance[0,j-1] + del-cost(source[j])
for each column i from 1 to n do
for each row j from 1 to m do
distance[i, j]← MIN(distance[i−1, j] + ins-cost(targeti−1),
distance[i−1, j−1] + subst-cost(source j−1, targeti−1),
distance[i, j−1] + del-cost(source j−1))
return distance[n,m]
21 July 2015
Veton Këpuska
160
Minimum Edit Distance Algorithm
 The minimum edit distance algorithm, an
example of the class of dynamic
programming algorithms is shown in the
previous slide.
 The various costs can either be
 fixed (e.g. ∀x, ins-cost(x) = 1),or
 can be specific to the letter (to model the fact
that some letters are more likely to be inserted
than others).
 We assume that there is no cost for
substituting a letter for itself (i.e. substcost(x,x) = 0).
21 July 2015
Veton Këpuska
161
Example of Distance Matrix Computation
(Cost = 1 for INS & DEL, Cost = 2 for SUB)
N
9
8
9
10
11
12
11
10
9
8
O
8
7
8
9
10
11
10
9
8
9
I
7
6
7
8
9
10
9
8
9
10
T
6
5
6
7
8
9
8
9
10
11
N
5
4
5
6
7
8
9
10
11
10
E
4
3
4
5
6
7
8
9
10
9
T
3
4
5
6
7
8
7
8
9
8
N
2
3
4
5
6
7
8
7
8
7
I
1
2
3
4
5
6
7
6
7
8
#
0
1
2
3
4
5
6
7
8
9
#
E
X
E
C
U
T
I
O
N
21 July 2015
Veton Këpuska
162
Alignement
 DP Algorithm can be used to find the minimum
cost alignment between two strings. Useful in
 Speech and language processing,
 Speech recognition
 Machine translation
21 July 2015
Veton Këpuska
163
Alignment Path
21 July 2015
Veton Këpuska
164
Example of Distance Matrix Computation
(Cost = 1 for INS & DEL, Cost = 2 for SUB)
N
9
8
9
10
11
12
11
10
9
88
O
8
7
8
9
10
11
10
9
88
9
I
7
6
7
8
9
10
9
88
9
10
T
6
5
6
7
8
9
88
9
10
11
N
5
4
5
6
7
88
9
10
11
10
E
4
3
3
4
4
5
5
6
6
7
8
9
10
9
T
3
3
4
5
5
6
7
8
7
8
9
8
N
2
2
33
4
5
6
7
8
7
8
7
I
1
1
2
3
4
5
6
7
6
7
8
#
00
1
2
3
4
5
6
7
8
9
#
E
X
E
C
U
T
I
O
N
21 July 2015
Veton Këpuska
165
Back-tracking/Backtrace
 Requires a minimal augmentation of original algorithm
by storing in each cell of the matrix information form
which cell the minimal score was derived.
 Homework: Exercise 3.12. Write a complete DP
algorithm to match two strings by finding minimal edit
distance and the best matching path. You can use:
 Matlab,
 C++,
 C,
 C#
 perl,
 php, …
21 July 2015
Veton Këpuska
166
Publicly Available Packages
 UNIX diff (Win32 cygwin package):
 http://cygwin.com/
 sclite and other software tools for speech &
language processing tools from NIST:
 http://www.nist.gov/speech/tools/index.htm
21 July 2015
Veton Këpuska
167
Human
Morphological
Processing
Human Morphological Processing
 Brief survey of psycholinguistic studies on how multimorphemic words are represented in the minds of speakers
of English language.
 Example:





walk → walks and walked.
happy → happily and happiness.
Are all three in the human lexicon? Or
walk + –ed and –s?
happy + -y → ily and y → iness
21 July 2015
Veton Këpuska
169
Human Morphological Processing
 Two extreme possibilities:
1. Full listing – proposes that all words of a language are listed
in the mental lexicon without any internal morphological
structure.
 This hypothesis is certainly untenable for morphologically
complex languages like Turkish.
2. Minimum redundancy hypothesis suggests that only the
constituent morphemes are represented in the lexicon and
when processing walks (whether for reading, listening, or
talking) we must access both morphemes (walk and –s) and
combine them.
21 July 2015
Veton Këpuska
170
Evidence of Human Lexicon
 Earliest evidence comes from speech errors – also
called slips of the tongue.
 In conversational speech, speakers often mix up the
order of the words or sounds:
 If you break it it’ll drop.
 In the slips of tongue collected by Fromkin and
Ratner (1998) and Garrett (1975) inflectional and
derivational affixes can appear separately from their
stems.
 It’s not only us who have screw looses (for “screws
loose” )
 Words of rule formation (for “rules of word formation”)
 Easy enoughly (for “easily enough). ⇒
 Suggests that the mental lexicon contains some
representation of morphological structure.
21 July 2015
Veton Këpuska
171
Evidence of Human Lexicon
 More recent experimental evidence
suggests neither the full listing nor the
minimum redundancy hypotheses may be
completely true.
 Stanners et al. evidence for:
 Words like: happily and happiness, are
stored separately from its stems: happy
 Regularly inflected forms: pouring are not
distinct in the lexicon from their stems:
pour.
21 July 2015
Veton Këpuska
172
Evidence of Human Lexicon
-ure
-al
department
depart
-s
govern
-ing
 Marslen-Wilson et al. (1994) result. Derived words are
linked to their stems only if semantically related.
21 July 2015
Veton Këpuska
173
Summary
Summary
 Morphology: an area of language processing
dealing with the subparts of words.
 Finite-State Transducer: computational device
that is important for morphology
 Stemming: Morphological Parsing (stem +
affixs),
 Word and Sentence Tokenization, and
 Spelling Error Detection
21 July 2015
Veton Këpuska
175
Summary










Morphological parsing is the process of finding the constituent
morphemes in a word (e.g., cat +N +PL for cats).
English mainly uses prefixes and suffixes to express inflectional
and derivational morphology.
English inflectional morphology is relatively simple and includes
person and number agreement (-s) and tense markings (-ed and -ing).
English derivational morphology is more complex and includes
suffixes like -ation, -ness, -able as well as prefixes like co- and re-.
Many constraints on the English morphotactics (allowable morpheme
sequences) can be represented by finite automata.
Finite-state transducers are an extension of finite-state automata
that can generate output symbols.
Important operations for FSTs include composition, projection, and
intersection.
Finite-state morphology and two-level morphology are
applications of finitestate transducers to morphological representation
and parsing.
Spelling rules can be implemented as transducers.
There are automatic transducer-compilers that can produce a
transducer for any simple rewrite rule.
21 July 2015
Veton Këpuska
176
Summary
 The lexicon and spelling rules can be combined by
composing and intersecting various transducers.
 The Porter algorithm is a simple and efficient way to do
stemming, stripping off affixes. It is not as accurate as a
transducer model that includes a lexicon, but may be
preferable for applications like information retrieval in
which exact morphological structure is not needed.
 Word tokenization can be done by simple regular
expressions substitutions or by transducers.
 Spelling error detection is normally done by finding
words which are not in a dictionary; an FST dictionary can
be useful for this.
 The minimum edit distance between two strings is the
minimum number of operations it takes to edit one into the
other. Minimum edit distance can be computed by dynamic
programming, which also results in an alignment of the
two strings.
21 July 2015
Veton Këpuska
177
END
21 July 2015
Veton Këpuska
178

my.fit.edu

Transcript my.fit.edu

Directory