Transcript Lecture 5
Experiences with Indian
Language Morphology
Monojit Choudhury
RS, CSE, IIT Kharagpur
28/07/2005
Speech and NLP
When do we need MA/MS?
Store all words
Advantages:
Less effort for NLP
Less time for processing
Disadvantages:
More words more space more search time
How to tackle unseen words
28/07/2005
Speech and NLP
Therefore, we need MA/MS when
The language is morphologically rich
large number of affixes
concatenation of affixes/compounding
Example: Turkish, German, Sanskrit …
The language is morphologically productive
Speakers/writers can coin new words by following
morphological rules
Example: German, Sanskrit …
28/07/2005
Speech and NLP
A Problem to ponder
How do we decide whether a language is
morphologically rich and/or productive?
Linguistically
Difficult (enumerate all morphological processes)
Fuzzy/Subjective
Can you suggest some formal technique?
Hint: Statistics
28/07/2005
Speech and NLP
Vocabulary Growth
200,000
BENGALI
(3019565,182848)
VOCAB
HINDI
SIZE ( V(N) )
(2967438, 121603)
CORPUS SIZE ( N )
28/07/2005
Speech and NLP
3,500,000
Another Estimate
How many different forms of a verb are
there in
English
Hindi
Bengali
Telugu
Sanskrit
28/07/2005
Speech and NLP
Another Estimate
How many different forms of a verb are
there in
English –
Hindi –
Bengali –
Telugu –
Sanskrit –
28/07/2005
5
~20 (without causation)
~170 (without causation)
~1000
~51480 (with derivational affixes)
~3960 (otherwise)
Speech and NLP
Three basic concerns
While designing a morphological
analyzer/generator one must consider
Productivity of a rule
Morphological paradigms
Irregular morphology
28/07/2005
Speech and NLP
Productivity of a Rule
Rule
Example
VR + tA
jAtA, letA
NR + ika
dainika,
sAmAjika
Adj + imA
lAlimA, niilimA
28/07/2005
Speech and NLP
Productivity
Productivity of a Rule
Rule
Example
Productivity
VR + tA
jAtA, letA
*****
NR + ika
dainika,
sAmAjika
**
Adj + imA
lAlimA, niilimA
X
28/07/2005
Speech and NLP
Productive Rules for Bengali/Hindi
Inflectional Morphology
Verb
Derivational Morphology
Noun
Compounding
Adjectives
Prefixation
Pronouns
Suffixation
Emphasizing in Bengali
i and o
28/07/2005
Speech and NLP
Productive Rules for Bengali/Hindi
Inflectional Morphology
Verb
Derivational Morphology
Noun
Compounding
Adjectives
Prefixation
Pronouns
Suffixation
Emphasizing in Bengali
i and o
28/07/2005
Speech and NLP
Three basic concerns
While designing a morphological
analyzer/generator one must consider
Productivity of a rule
Morphological paradigms
Irregular morphology
28/07/2005
Speech and NLP
Morphological paradigms
Classes of words that inflect similarly
Hindi Noun roots take 4 inflections
Singular, direct laDakA, laDakii
Plural, direct laDake, laDakiyA.N
Singular, oblique laDake, laDakii
Plural, oblique laDako, laDakiyo.N
How many paradigms for nouns?
28/07/2005
Speech and NLP
How to identify the paradigms?
Paradigms may be based on
Syllable structure (e.g laDakii, nadii, sakhii)
Gender (e.g. dhobii vs. nadii)
Semantics (e.g. lohA vs. dohA)
Which of these distinctions can be
identified automatically? How?
28/07/2005
Speech and NLP
Paradigms for Bengali Nouns
Bengali noun inflections:
Classifier Suffixes TA, gulo, rA etc.
Case Markers er, ke, der, te etc.
Emphasizers i, o
Paradigms are based on semantics
Inanimate objects take TA, gulo
Animate objects take rA, dera
28/07/2005
Speech and NLP
Three basic concerns
While designing a morphological
analyzer/generator one must consider
Productivity of a rule
Morphological paradigms
Irregular morphology
28/07/2005
Speech and NLP
Irregular Morphology
All languages feature irregular morphology
English: ox – oxen, go – went
Hindi: jAnA – gayA, karanA – kiyA
Bengali: yAoYA – gela, AsA – ela
Better to list them as exceptions and treat
separately
Bengali has only 4 exceptional verbs,
Hindi has 2
28/07/2005
Speech and NLP
So, we decided to
Build MS/MA for Hindi & Bengali
Cover only inflectional morphology
Cover only verbs, nouns and adjectives
We also identified
the morphological paradigms
Irregular verbs/nouns
28/07/2005
Speech and NLP
Now we need to decide
The list of possible affixes
There attributes
Morphotactics
And then design/build
The Input/output specification
The lexicon structure
The FST structure
Lexicon and FST search strategy
28/07/2005
Speech and NLP
A Case Study: Bengali Verb Morphology
The information coded by affixes:
Finite forms
Tense: Past, present, future
Aspect: simple, continuous, perfect, habitual
Modality: Order, request
Person: 1st, 2nd normal (tumi), 2nd familiar (tui),
3rd (se), Honorific 2nd and 3rd (Apani, tini)
Polarity: positive/negative
Non-finite forms: e, te
28/07/2005
Speech and NLP
Morphotactics
Root
Aspect
Tense
kar
(to do)
eChi
(perfect)
l
(past)
Ama
(1st)
Φ
(+)
I had done
kar
Ch
(cont.)
Φ
(present)
i
(1st)
Φ
(+)
I’m doing
kar
Φ
(simple)
b
(future)
i
Φ
(2nd fam) (+)
You’ll do
kar
Φ
(perfect)
Φ
(pre/pst)
28/07/2005
Person +/-
Speech and NLP
i
(1st)
ni
(-)
Gloss
I’ven’t done
I’d not done
Morphotactics
Root + aspect + tense +
person + emphasizer + polarity
Root + modality + person + emphasizer
Root + aspect1 + emphasizer +
aspect2 + person + polarity
28/07/2005
Speech and NLP
Verb Suffix Table
TAM/ Person
1st
2nd, familiar
2nd, normal
2nd & 3rd formal
3rd
Ind, Pr, Simple
i
isa’
ena’
e
Ind, Pr, Cont
chhi
chhisa’
chha
chhena’
chhe
Ind, Pr, Perfect
echhi
echhisa’
echha
echhena’
echhe
Ind, Pa, Simple
lAma’
li
le
lena’
la
Ind, Pa, Cont.
chhilAma’
chhili
chhile
chhilena’
chhila
Ind, Pa, Perfect
echhilAma’
echhili
echhile
echhilena’
echhila’
Ind, Future
ba
bi
be
bena’
be
Habitual Past
tAma’
tisa’
te
tena’
ta
Imperative
-
.h/
una’
uka’
Neg, Perfect
ini
isa’ni
ani
ena’ni
eni
28/07/2005
Speech and NLP
Orthographic Changes
kar + eChilAm kareChilAm
khA + eChilAm kheYeChilAm
hAr + eChilAm hereChilAm
karA + eChilAm kariYeChilAm
tolA + eChilAm tuliYeChilAm
khAoYA + eChilAm khAiYeChilAm
de + eChilAm diYeChilAm
28/07/2005
Speech and NLP
Orthographic Classes (Paradigms?)
$
V
a’
A
oYA
a
ha [haoYA]
(to happen)
kara’ [karA]
(to do)
karA [karAno]
(do, causative)
saoYA [saoYAno]
(undergo, causative)
A
khA [khAoYA]
(to eat)
jAna’ [jAnA]
(to know)
jAnA [jAnAno]
(to inform)
khAoYA [khAoYAno]
(to feed)
i
di [deoYA]
(to give)
likha’ [lekhA]
(to write)
ni~NrA [ni~NrAno]
--
e
--
dekha’ [dekhA]
(to see)
dekhA [dekhAno]
(to show)
deoYA [deoYAno]
(give, causative)
o
so [so;oYA]
(to lie down)
tola’ [tolA]
(to pick)
tolA [tolAno]
(pick, causative)
so;oYA [so;oYAno]
(lie, causative)
u/au
--
--
ghumA [ghumAno]
(to sleep)
--
28/07/2005
Speech and NLP
FSM for Recognizing Bengali Verb Class
28/07/2005
Speech and NLP
A Morphological Generator:
Abstract Level
Morphological Generator
Root
TAM
Person
Polarity
Emph
28/07/2005
Suffix
Table
Suffix
Speech and NLP
Orthographic
FST
Surface
Form
A Morphological Generator:
Implementation
Morphological Generator
Root
TAM
Person
Polarity
Irregular Root
Handler
Root Class
Recognizer
Suffix
Table
Orthographic
Rules
for each
Root class
Emph
28/07/2005
Speech and NLP
Emph
Adder
Surface
Form
Implementation: More Facts
Memory Requirement
Root Class Recognizer: FSM with 26 states
Suffix Table: 56 suffixes (emphasizers not incl.)
Orthographic Rule Tables: 19×56 = 1064 rules
Time Requirement
Root Class Recognizer: scans the root once (r)
Suffix Selection: just table look up (constant)
Orthographic Rules: scans root + suffix once (r+s)
Emphasizer Adder: Constant time
Total time: O(r+s)
28/07/2005
Speech and NLP
Now we need to decide
The list of possible affixes
There attributes
Morphotactics
And then design/build
The Input/output specification
The lexicon structure
The FST structure
Lexicon and FST search strategy
28/07/2005
Speech and NLP
A Morphological Analyzer:
Abstract Level
Trie: A data structure also called a suffix tree. (from
Information Retrieval)
Basic Notions:
Note that Bengali verb morphology only has suffixes
Scan a given word from right to left (backward)
If the substring seen is a valid suffix, see if the remaining part
of the input is a valid stem/root
Take care of orthographic changes
We shall see that trie is just another way to implement
FST with some nice properties
28/07/2005
Speech and NLP
Trie: Construction
Make a list of all valid suffixes
NULL, i, Chi, li, eChi, YeChi, lAma, elAma
Construct the trie recursively by inserting each
of the suffixes (right to left)
Every state where a suffix ends is marked as a
final state
Every final state consists of
TAM, Person, Polarity information
Rewrite rules for generation of the root
28/07/2005
Speech and NLP
Trie: Search
Reverse the input word
Traverse the trie starting from the root (start
state)
At every final state apply the orthographic rule to
the rest of the string
Let r be the string obtained. Search for r in the
root lexicon
If found, output the attributes
Continue the search
28/07/2005
Speech and NLP
Trie: Computational Issues
Time Complexity
Searching the trie is linear on input length
Searching the lexicon can also be linear
Space Complexity
In general linear in number of affixes
Can be reduced further by constructing DAWG
28/07/2005
Speech and NLP
Trie vs DAWG
Trie
More space
Linear Search
Easy to construct
Easy to insert &
delete
Final states have
unique attributes
28/07/2005
DAWG
Less space
Linear search
Exponential construction
Difficult to delete and
insert
A final state can have
ambiguous attributes
Speech and NLP
Morphological Analyzer:
Implementation Details
Size of Trie: 300 states
Size of root lexicon: 600 verb root
Paradigm Information: Not required
Noun, verb and adjectives are separately
analyzed
Tries can be merged but no significant gain
Root lexicons are also distinct
Rule compilation
28/07/2005
Speech and NLP
Summarizing
Decide whether to go for MA/MS
Identify the productive morphological processes
and corresponding irregularities
Identify the paradigms and morphological
attributes
Specify the morphotactics, affix list
Gather a Machine readable root lexicon
Choose appropriate computational technique
Design, implement and test
A good interface for rule-editing is desirable
28/07/2005
Speech and NLP