Maximum Entropy Modeling and its application to NLP

Download Report

Transcript Maximum Entropy Modeling and its application to NLP

Maximum Entropy Modeling and
its application to NLP
Utpal Garain
Indian Statistical Institute, Kolkata
http://www.isical.ac.in/~utpal
Language Engineering in Daily Life
In Our Daily Life
• Message, Email
– We can now type our message in my own language/script
• Oftentimes I need not write the full text
– My mobile understands what I intend to write!!
– I ha reac saf
– I have reached safely
• Even if I am afraid of typing in my own language (so
many letters, spellings are so difficult.. Uffss!!)
– I type my language in “English” and my computer or my
mobile types it in my language!!
– mera bharat..
In Our Daily Life
• I say “maa…” to my cell
– and my mother’s number is called!
• I have gone back to my previous days and left typing in the
computer/mobile
– I just write on a piece of paper or scribble on the screen
– My letters are typed!!
• Those days were so boring…
–
–
–
–
If you are an exiting customer press 1 otherwise press 2
If you remember your customer ID press 1 otherwise press 2
So on and so on..
I just say “1”, “service”, “cricket” and the telephone understands what I
want!!
• My grandma can’t read English but she told she found her name
written in Hindi in Railway reservation chart
– Do Railway staff type so many names in Hindi everyday
– NO!! Computer does this
In Our Daily Life
• Cross Lingual Information Search
– I wanted to know what exactly happened that created
such a big inter-community problem in UP
– My friend told me read UP newspaper
– I don’t know Hindi 
– I gave query in the net in my language
– I got news articles from UP local newspaper translated in
my language!! Unbelievable!!!
• Translation
– I don’t know French
– Still I can chat with my French friend 
In Our Daily Life
• I had problem to draw the diagram for this
– ABCD is a parallelogram, DC is extended to E such that BCE is
an equilateral triangle.
– I gave it to my computer and it draws the diagram showing
the steps!!
• I got three history books for my son and couldn’t
decide which one will be good for him
– My computer suggested Book 2 as it has better readability
for a grade-V student
– Later on, I found it is right!!!
• I type questions in the net and get answers (oftentimes
they are correct!!)
– How does it happen?!!!
Language
• Language is key to culture
– Communication
– Power and Influence
– Identity
– Cultural records
• The multilingual character of Indian society
– Need to preserve this character to move successfully
towards closer cooperation at a political, economic, and
social level
• Language is both the basis for communication and
a barrier
Role of Language
Courtesy: Simpkins and Diver
Language Engineering
• Application of knowledge of language to the
development of computer systems
– That can understand, interpret and generate human
language in all its forms
• Comprises a set of
– Techniques and
– Language resources
Components of Language Engg.
• Get material
– Speech, typed/printed/handwritten text, image, video
• Recognize the language and validate it
– Encoding scheme, distinguishing separate words..
• Build an understanding of the meaning
– Depending on the application you target
• Build the application
– Speech to text
• Generate and present the results
– Use monitor, printer, plotter, speaker, telephone…
Language Resources
• Lexicons
– Repository of words and knowledge about them
• Specialist lexicons
– Proper names, Terminology
– Wordnets
• Grammars
• Corpora
– Language sample
– Text, speech
– Helps to train a machine
NLP vs. Speech
• Consider these two types of problems:
– Problem set-1
• “I teach NLP at M.Tech. CS”=> what’s in Bengali?
• Scan newspaper, pick out those news dealing with
forest fires, fill up a database with relevant information
– Problem set-2
• In someone’s utterance you might have difficulty to
distinguish between “merry” from “very” or “pan” from
“ban”
• Context often overcomes this
– Please give me the ??? (pan/ban)
– The choice you made was ??? good.
NLP
• NLU community is more concerned about
– Parsing sentences
– Assigning semantic relations to the parts of a
sentence
– etc…
• Speech recognition community
– Predicting next word on the basis of the words so
far
• Extracting the most likely words from the signal
• Deciding among these possibilities using knowledge
about the language
NLP
• NLU demands “understanding”
– Requires a lot of human effort
• Speech people rely on statistical technology
– Absence of any understanding limits its ability
• Combination of these two techniques
NLP
• Understanding
– Rule based
• POS tagging
• Tag using rule base
– I am going to make some tea
– I dislike the make of this shirt
– Use grammatical rules
– Statistical
• Use probability
• Probability of sequence/path
– PN
– PN
VG
V
PREP
ART
V/N?
V/N?
ADJ
PP
N
Basic Probability
Probability Theory
• X: random variable
– Uncertain outcome of some event
• V(X): outcome
– Example event: open to some page of an English book and
X is the word you pointed to
– V(X) ranges over all possible words of English
• If x is a possible outcome of X, i.e. x  V(X)
– P(X=x) or P(x)
• Wi is the i-th word prob. Of picking up the i-th word is
P( w  w ) 
i
| wi |
w
| w
j
|
j 1
• if U denotes the universe of all possible outcomes then the
denominator is |U|.
Conditional Probability
• Pick up two words which are in a row -> w1 and w2
– Or, given the first word, guess the second word
– Choice of w1 changes things
P( w2  w j | w1  wi ) 
| w1  wi , w2  w j |
|w1  wi |
• Bayes’ law: P(x|y) = P(x) * P(y|x)/P(y)
– |x,y|/|y|=|x|/|U| * |y,x|/|U|/|y|/|U|
• Given some evidence e, we want to pick up the best
conclusion P(c|e)… it is done
– if we know P(c|e) = P(c) * P(e|c) /P(e)
• Once evidence is fixed then the denominator stays
the same for all conclusions.
Conditional Probabiliy
• P(w,x|y,z) = P(w,x) P(y,z|w,x) / P(y,z)
• Generalization:
– P(w1,w2,…,wn) = P(w1) p(w2|w1) P(w3|w1,w2)
…. P(wn|w1, w2, wn-1)
– P(w1,w2,…,wn|x) = P(w1|x) p(w2|w1,x)
P(w3|w1,w2,x) …. P(wn|w1, w2, wn-1,x)
– P(W1,n = w1,n)
• Example:
– John went to ?? (hospital, pink, number, if)
Conditional Probability
• P(w1,n|speech signal)
= P(w1,n) P(signal|w1,n)/P(signal)
• Say, there are words (a1,a2,a3) (b1,b2)
(c1,c2,c3,c4)
• P(a2,b1,c4|signal) and P(a2,b1,c4)
• P(a2,b1,c4|signal)
= P(a2,b1,c4) * P(signal|a2,b1,c4)
• Example:
– The {big / pig} dog
– P(the big dog) = P(the) P(big|the) P(dog|the big)
– P(the pig dog) = P(the) P(pig|the) P(dog|the pig)
Application Building
• Predictive Text Entry
–
–
–
–
–
–
Tod => Today
I => I
=> have
a => a
ta => take, tal => talk
Tod => today tod toddler
• Techniques
– Probability of the word
– Probability of the word at position “x”
– Conditional probability
• What is the probability of writing “have” after writing two words
“today” and “I”
• Resource
– Language corpus
Application Building
• Transliteration
– Kamal => কমল
• Indian Railways did it before
– Rule based
• Kazi => কাজী
• Ka => ক or কা
– Difficult to extend it other languages
• Statistical model
–
–
–
–
N-gram modeling
Kamal=> ka am ma al; kazi => ka az zi
কমল => কম মল; কাজী => কা াাজ জী
Alignment of pairs (difficult computational problem)
Transliteration
• Probability is computed for
– P(ka=>ক), P(ক), P(কমল ), etc.
• Best probable word is the output
• Advantage:
– Easily extendable to any language pairs
– Multiple choices are given (according to rank)
• Resource needed
– Name pairs
– Language model
Statistical Models and Methods
Statistical models and methods
• Intuition to make crude probability judgments
• Entropy
– Situation
No occu
1st occu
2nd occu
Both
Prob.
0.5
0.125
o.125
0.25
• [1*1/2+2*1/4+3*(1/8+1/8)]bits = 1.75 bits
• Random variable W takes on one of the several values
V(W), entropy: H(W) = -P(w)log P(w); wV(W)
• -logP(w) bits are required to code w
Use in Speech
• {the, a, cat, dog, ate, slept, here, there}
• If use of each word is equal and independent
• Then the entropy of the language
-P(the)logP(the)-P(a)log P(a)…
=8.(-1/8*log1/8)
=3
– H(L) = lim [1/n P(w1,n)logP(w1,n)]
Markov Chain
• If we remove the numbers then it’s a finite state automaton
which is acceptor as well as generator
• Adding the probabilities we make it probabilistic finite state
automaton => Markov Chain
• Assuming all states are accepting states (Markov Process), we
can compute the prob. of generating a given string
• Product of probabilities of the arcs traversed in generating
the string.
Cross entropy
• Per word entropy of the previous model is
– [0.5log (1/2)]
– At each state only two equi-probable choices so
• H(p) = 1
• If we consider each word is equi-probable then
H(pm) = 3 bits/word
• Cross Entropy
– Cross entropy of a set of random variables W1,n
where correct model is P(w1,n) but the
probabilities are estimated using the model
Pm(w1,n) is H (W1,n , Pm)   P(w1,n ) log Pm (w1,n )

w1,n
Cross entropy
1
H (W1,n , Pm)
n
• Per word cross entropy is
• Per word entropy of the given Markov Chain: 1
• If we slightly change the model:
– Outgoing probabilities are 0.75 and 0.25
– per word entropy becomes
• -[1/2log(3/4)+1/2log(1/4)]
= - (1/2) [log 3 – log4 +log 1 – log4]
= - (1/2) [log 3 – log4]
= 2 – 1.7/2 = 1.2
• Incorrect model:
– H(W1,n)  H(W1,n, PM)
Cross entropy
1
H (W1,n , Pm)
n
• Per word cross entropy is
• Per word entropy of the given Markov Chain: 1
• If we slightly change the model:
– Outgoing probabilities are 0.75 and 0.25
– per word entropy becomes
• -[1/2log(3/4)+1/2log(1/4)]
= - (1/2) [log 3 – log4 +log 1 – log4]
= - (1/2) [log 3 – log4]
= 2 – 1.7/2 = 1.2
• Incorrect model:
– H(W1,n)  H(W1,n, PM)
Cross entropy
• Cross entropy of a language
– A stochastic process is ergodic if its statistical
properties (i.e. and ) can be computed from a
single sufficiently large sample of the process.
– Assuming L is an ergodic language
– Cross entropy of L is
– H (L, PM)
1
 lim
P ( w1, n ) log PM ( w1, n )
=
n 
n

1
  lim log PM ( w1, n )
n  n
Corpus
• Brown corpus
– Coverage
•
•
•
•
•
•
500 text segments of 2000 words
Press, reportage etc.
44
Press editorial etc.
27
Press, reviews
17
Religion books, periodicals..17
…
Trigram Model
Trigram models
• N-gram model…
• P(wn|w1…wn-1) = P(wn|wn-1wn-2)
• P(w1,n)=P(w1)P(w2|w1)P(w3|w1w2).. P(wn|w1,n-1)
=P(w1)P(w2|w1)P(w3|w1w2).. P(wn|wn-1,n-2)
• P(w1,n)=P(w1)P(w2|w1)P(wi|wi-1wi-2)
• Pseudo words: w-1, w0
• “to create such”
• #to create such=?
• #to create=?
n
P( w1, n ) 
 P( w | w
i
i 1,i  2 )
i 1
Pe ( wi | wi 1,i  2 ) 
C ( wi  2,i )
C ( wi  2,i 1)
Trigram as Markov Chain
• It is not possible to determine state of the
machine simply on the basis of the last output
(the last two outputs are needed)
• Markov chain of order 2
Problem of sparse data
• Jelinek stuided
– 1,500,000 word corpus
– Extracted trigrams
– Applied to 300,000 words
– 25% trigram types were missing
Maximum Entropy Model
An example
• Machine translation
– Star in English
• Translation in Hindi:
– सितारा, तारा, तारक, प्रसिद्ध असिनेता, िाग्य
• First statistics of this process
– p(सितारा)+p(तारा)+p(तारक)+p(प्रसिद्ध
असिनेता)+p(िाग्य) = 1
• There are infinite number of models p for
which this identity holds
• One model
– p(सितारा) = 1
– This model always predicts सितारा
• Another model
– p(तारा) = ½
– p(प्रसिद्ध असिनेता) = ½
• These models offend our sensibilities
– The expert always chose from the five choices
– How can we justify either of these probability
distribution ?
– These models bold assumptions without empirical
justification
• What we know
– Experts chose exclusively from these five words
• the most intuitively appealing model is
– p(सितारा) = 1/5
– p(तारा) = 1/5
– p(तारक) = 1/5
– p(प्रसिद्ध असिनेता) = 1/5
– p(िाग्य) = 1/5
• The most uniform model subject o our
knowledge
• Suppose we notice that the expert’s chose
either सितारा or तारा 30% of the time
• We apply this knowledge to update our model
– p(सितारा)+p(तारा) = 3/10
– p(सितारा)+p(तारा)+p(तारक)+p(प्रसिद्ध
असिनेता)+p(िाग्य) = 1
• Many probability distributions consistent with
the above constraints
• A reasonable choice for p is again the most
uniform
• i.e. the distribution which allocates its
probability as evenly as possible, subject to
the constraints
– p(सितारा) = 3/20
– p(तारा) = 3/20
– p(तारक) = 7/30
– p(प्रसिद्ध असिनेता) = 7/30
– p(िाग्य) = 7/30
• Say we inspect the data once more and notice
another interesting fact
– In half the cases, the expert chose either सितारा or
प्रसिद्ध असिनेता
• So we add a third constraint
– p(सितारा)+p(तारा) = 3/10
– p(सितारा)+p(तारा)+p(तारक)+p(प्रसिद्ध
असिनेता)+p(िाग्य) = 1
– p(सितारा)+p(प्रसिद्ध असिनेता) = ½
• Now if we want to look for the most uniform p
satisfying the constraints the choice is not as
obvious
• As complexity added, we have two difficulties
– What is meant by “uniform” and how can we measure
the uniformity of a model
– How will we find the most uniform model subject to a
set constraints?
• Maximum entropy method (E. T. Jaynes) answers
both of these questions
Maximum Entropy Modeling
• Consider a random process that produces an output value
y (a member of a finite set, Y)
• For the translation example just considered, the process
generates a translation of the word star, and the output y
can be any word in the set {सितारा, तारा, तारक, प्रसिद्ध
असिनेता, िाग्य}.
• In generating y, the process may be influenced by some
contextual information x, a member of a finite set X.
• In the present example, this information could include the
words in the English sentence surrounding star.
• Our task is to construct a stochastic model that accurately
represents the behavior of the random process.
Maximum Entropy Modeling
• Such a model is a method of estimating the
conditional probability that, given a context x,
the process will output c.
• We will denote by p(clx) the probability that the
model assigns to y in context x.
• We will denote by P the set of all conditional
probability distributions. Thus a model p(c|x) is,
by definition, just an element of P.
Training Data
• A large number of samples
– (x1,c1), (x2, c2) . . . (xN, cN).
• Each sample would consist of a phrase x
containing the words surrounding star, together
with the translation c of star that the process
produced.
• Empirical probability distribution p᷉
1
~
p( x, c)   number of time that ( x, c) occursin thesample
N
Features
• The feature fi are binary functions that can be
used to characterize any property of a pair (ẋ, c),
• ẋ is a vector representing an input element and c
is the class label
• f(x, c) = 1 if c = प्रसिद्ध असिनेता and star follows
cinema; otherwise = 0
Features
• We have two things in hand
– Empirical distribution
– The model p(c|x)
• The expected value of f with respect to the
empirical distribution is ~p( f )   ~p( x, c) f ( x, c)
x,c
• The expected value of f with respect to the
model p(c|x) is
p( f )   ~
p ( x) p(c | x) f ( x, c)
x ,c
• Our constraint is
p( f )  ~
p( f )
Classification
• For a given ẋ we need to know its class label c
– p(ẋ, c)
• Loglinear models
– General and very important class of models for
classification of categorical variables
– Logistic regression is another example

1 K f i ( x ,c )
K


p ( x , c)    i
log p( x , c)   logZ   f i ( x , c) log i
Z i 1
i 1
– K is the number of features, i is the weight for the
feature fi and Z is a normalizing constant used to
ensure that a probability distribution results.
An example
• Text classification
• ẋ consists of a single element, indicating presence
or absence of the word profit in the article
• Classes, c
– two classes; earnings or not
• Features
– Two features
• f1: 1 if and only if the article is “earnings” and the word
K
profit is in it


f K 1 ( x , c)  C   f i ( x , c)
• f2: filler feature (fK+1)
i 1
– C is the greatest possible feature sum
An example
Ẋ
c
Profit
“earnings”
f1
f2
=f1log1+f2log2
2
(0)
0
(0)
1
(1)
0
(1)
1
• Parameters:
0
0
0
1
1
1
1
0
1
1
1
2
2
2
2
4
•
•
•
•
– log1 = 2.0, log2 = 1.0
K
2   i


f i ( x ,c )
    f i log i
i 1
Z = 2+2+2+4 = 10
p(0,0) = p(0,1) = p(1,0) = 2/10 = 0.2
p(1,1) = 4/10 = 0.4
A data set that follows the same empirical
distribution
– ((0,0), (0,1),(1,0), (1,1),(1,1))
i
Computation of i and Z
• We search for a model p* such that
E p* f i  E ~p f i
• Empirical expectation


1
E ~p f i   ~
p ( x , c) f i ( x , c) 

N
x ,c
N
 f i ( x j , c)
j 1
• In general Epfi cannot be computed efficiently
as it would require summing over all possible
combinations of ẋ and c, a huge or infinite set.
• Following approximation is followed



1
~
E p f i   p ( x ) p (c | x ) f ( x , c ) 

N
x ,c
N


 p(c | x j ) f i ( x j , c)
j 1 c
Generalized Iterative Scaling Algo
• Step 1
– For all i =1, K+1, initialize i(1).
– Compute empirical expectation
– Set n =1
• Step 2
– Compute pn(x,c) for the distribution pn given by
the {j(n)} for each element (x,c) in the training
set
K 1

1
p(n)(x , c)   ( i( n ) )
Z i 1

fi ( x ,c )
K 1
where Z   ( i( n ) )

x,c i 1

fi ( x ,c )
Generalized Iterative Scaling Algo
• Step 3
– Compute Ep(fi) for all I = 1, … K+1 according
formula shown before
• Step 4
– Update the parameters i

• Step 5
( n 1)
i

(n)
i
 E ~p f i

 E (n) fi
 p




1
C
– If the parameters have converged, stop; otherwise
increment n and go to Step 2.
Application of MaxEnt in NLP
• POS tagger
– Stanford tagger
• At our lab
– Honorific information
– Use of this information for Anaphora Resolution
– BioNLP
• Entity tagger
• Stanford Univ. has open source code for MaxEnt
• You can also use their implementation for your
own task.
HMM, MaxEnt and CRF
• HMM
– Observation and class
• MaxEnt
– Local decision
• CRF
– Combines good of HMM and MaxEnt