Transcript 投影片 1

Classical Probabilistic Models and
Conditional Random Fields
Roman Klinger1, Katrin Tomanek2
Algorithm Engineering Report (TR07-2-013), 2007
1. Dortmund University of Technology Department of Computer Science
2. Language & Information Engineering (JULIE) Lab
Presented by Jian-Shiun Tzeng 4/24/2009
Abstract
•
•
•
•
•
Introduction
Naïve Bayes
HMM
ME
CRF
2
Introduction
• Classification is known as the assignment of a
class y ∈ Y to an observation x ∈ X
• A well-known example (Russell and Norvig, 2003)
– Classification of weather
– Y = {good, bad}
– X = {Monday, Tuesday,…}
• x can be described by a set of features
• fcloudy, fsunny or frainy
1, if it is cloudy on day x
f cloudy (x)  
otherwise
0,
In general, not necessary have to be binary
3
• Modeling all dependencies in a probability
distribution is typically very complex due to
interdependencies between features
• The Naïve Bayes assumption of all features being
conditionally independent is an approach to
address this problem
• In nearly all probabilistic models such
independence assumptions are made for some
variables to make necessary computations
manageable
4
• In the structured learning scenario, multiple and
typically interdependent class and observation
variables are considered which implicates an
even higher complexity in the probability
distribution
– As for images, pixels near to each other are very likely
to have a similar color or hue
– In music, different succeeding notes follow special
laws, they are not independent, especially when they
sound simultaneously. Otherwise, music would not be
pleasant to the ear
– In text, words are not an arbitrary accumulation, the
order is important and grammatical constraints hold
5
Y = {name, city, 0}
X: set of words
6
• One approach for modeling linear sequence
structures, as can be found in natural language
text, are HMMs
• For the sake of complexity reduction, strong
independence assumptions between the
observation variables are made
– This impairs the accuracy of
the model
observation variables
7
• Conditional Random Fields (CRFs, Lafferty et al.
(2001)) are developed exactly to fill that gap:
– While CRFs make similar assumptions on the
dependencies among the class variables, no
assumptions on the dependencies among observation
variables need to be made
observation variables
8
• In natural language processing, CRFs are currently a
state-of-the-art technique for many of its subtasks
including
–
–
–
–
–
–
–
–
–
–
–
basic text segmentation (Tomanek et al., 2007)
part-of-speech tagging (Lafferty et al., 2001)
shallow parsing (Sha and Pereira, 2003)
the resolution of elliptical noun phrases (Buyko et al., 2007)
named entity recognition (Settles, 2004; McDonald and
Pereira, 2005; Klinger et al., 2007a,b; McDonald et al.,
2004)
gene prediction (DeCaprio et al., 2007)
image labeling (He et al., 2004)
object recognition (Quattoni et al., 2005)
telematics for intrusion detection (Gupta et al., 2007)
sensor data management (Zhang et al., 2007)
9
Figure 1: Overview of probabilistic models
MEMM
10
Abstract
•
•
•
•
•
Introduction
Naïve Bayes
HMM
ME
CRF
11
Naïve Bayes
• A conditional probability model is a probability
distribution
with an input
vector
, where
are
features and is the class variable to be
predicted. That probability can be formulated
with Bayes' law
(1)
–
–
: tag
: word
not used here
12
constant for all y
(2)
Complex:
Model probabilities
conditioned on various
# of variables
p ( x1 ,..., xm )  p( x1 ) p( x2 | x1 )...p ( xm | xm 1 ,..., x1 )
m
 p ( x1 ) p( xi | xi 1 ,..., x1 )
i 2

p ( y, x )  p ( y ) p ( x1 | y ) p ( x2 | x1 , y )...p ( xm | xm 1 ,..., x1 , y )
1
m
2
3
 p ( y ) p ( x1 | y ) p ( xi | xi 1 ,..., x1 , y )
(3)
i 2
13
• In practice, it is often assumed, that all input
variables are conditionally independent of
each other which is known as the Naïve Bayes
assumption
p( xi , x j | y )  p( xi | y ) p( x j | y )
p( xi | y, x j ) 

holdsi  j
p( y, xi , x j )
p( y, x j )

p( y ) p( xi , x j | y )
p( y ) p( x j | y )
p( y ) p( x j | y ) p( xi | y )
p( y ) p( x j | y )
 p( xi | y )
14
p( y, x3 , x2 , x1 )
p( x3 | y, x2 , x1 ) 
p( y, x2 , x1 )
p( y ) p( x3 , x2 , x1 | y )

p( y ) p( x2 , x1 | y )
p( y ) p( x1 | y ) p( x2 | y ) p( x3 | y )

p( y ) p( x1 | y ) p( x2 | y )
 p( x3 | y )
15
Simple:
Model probabilities
conditioned on only y


p ( y | x )  p( y, x )
 p( y ) p( x1 | y ) p( x2 | x1 , y )...p( xm | xm1 ,..., x1 , y )
 p( y ) p( x1 | y ) p( x2 | y )...p( xm | y )
m
 p( y ) p( xi | y )
i 1
(4)
less complex than (3)
m

p( y, x )  p( y) p( x1 | y) p( xi | xi 1 ,..., x1 , y)
i 2
16
• Dependencies between the input variables are
not modeled, probably leading to an imperfect
representation of the real world
• Nevertheless, the Naïve Bayes Model performs
surprisingly well in many real world applications,
such as email classification (Androutsopoulos et
al., 2000a,b; Kiritchenko and Matwin, 2001)
17
Abstract
•
•
•
•
•
Introduction
Naïve Bayes
HMM
ME
CRF
18
HMM
• In the Naïve Bayes Model, only single output
variables have been considered
• To predict a sequence of class variables
for an observation sequence
a
simple sequence model can be formulated as a
product over single Naïve Bayes Models
19
• Note, that in contrast to the Naïve Bayes Model there is only
one feature at each sequence position, namely the identity of
the respective observation
• Dependencies between single sequence positions are not
taken into account
• Each observation xi depends only on the class variable yi at
the respective sequence position
(5)
X
X
Naïve Bayes
20
• it is reasonable to assume that there are
dependencies between the observations at
consecutive sequence positions
(5)
(6)
(7)
21
• Dependencies between output variables ~y are
modeled. A shortcoming is the assumption of
conditional independence (see equation 6)
between the input variables ~x due to complexity
issues
• As we will see later, CRFs address exactly this
problem
22
Abstract
•
•
•
•
•
Introduction
Naïve Bayes
HMM
ME
CRF
23
ME (Maximum Entropy)
• The previous two models are trained to maximize
the joint likelihood
• In the following, the Maximum Entropy Model is
discussed in more detail as it is fundamentally
related to CRFs
24
• The Maximum Entropy Model is a conditional
probability model
• It is based on the Principle of Maximum Entropy
(Jaynes, 1957) which states that if incomplete
information about a probability distribution is
available, the only unbiased assumption that can
be made is a distribution which is as uniform as
possible given the available information
25
• Under this assumption, the proper probability
distribution is the one which maximizes the
entropy given the constraints from the training
material
• For the conditional model p(y|x) the conditional
entropy H(y|x) (Korn and Korn, 2000; Bishop,
2006) is applied, which is defined as
(8)
26
• The basic idea behind Maximum Entropy Models
is to find the model p*(y|x) which on the one
hand has the largest possible conditional entropy
but is on the other hand still consistent with the
(9)
information from the training material
• The objective function, later referred to as primal
problem, is thus
27
• The training material is represented by features
• Here, these are defined as binary-valued
functions
which
depend on both the input variable x and the class
variable y. An example for such a function is
(10)
28
• The expected value of each feature fi is
estimated from the empirical distribution ~ p(x; y)
• The empirical distribution is obtained by simply
counting how often the different values of the
variables occur in the training data
(11)
29
(11)
• As the empirical probability for a pair (x, y) which
is not contained in the training material is 0
•
can be rewritten as
(12)
– The size of the training set is
–
can be calculated by counting how often a
feature fi is found with value 1 in the training data
and dividing that number by the size N of the training
set
30
(11)
• Analogously to equation 11, the expected value
of a feature on the model distribution is defined
as
(13)
• In contrast to equation 11 (the expected value on
the empirical distribution), the model distribution
is taken into account here
• Of course, p(x; y) cannot be calculated in general
because the number of all possible x 2 X can be
enormous
(14)
31
• This is an approximation to make the calculation
of E(fi) possible (see Lau et al. (1993) for a more
detailed discussion). This results in
  ~
p ( x ) p ( y | x ) f i ( x, y )
x X yY
(15)
~
p ( x )  p ( y | x ) f i ( x, y )
x X
yY
• which can (analogously to equation 12) be
transformed into
(16)
32
• Equation 9 postulates that the model p*(y|x) is
consistent with the evidence found in the training
material
• That means, for each feature fi its expected value
on the empirical distribution must be equal to its
expected value on the particular model
distribution, these are the first m constraints
(17)
• Another constraint is to have a proper conditional
probability ensured by
(18)
33
• Finding p* (y|x) under these constraints can be
formulated as a constrained optimization
problem
• For each constraint a Lagrange multiplier λi is
introduced. This leads to the following Lagrange
function (p; ~ )
(19)
?
∀x
34
• In the same manner as done for the expectation
values in equation 15, H(y|x) is approximated
(20)
(21)
H ( y | x)  H (Y | X )    ~
p ( x) p ( y | x) log p ( y | x)
x X yY
 ~
p ( x1 ) p ( y1 | x1 ) log p ( y1 | x1 )  ...  ~
p ( xs ) p( yk | xs ) log p ( yk | xs )  ...
~
p ( x ) p ( y | x ) log p ( y | x )
l
n
l
n
l

p ( y k | xs ) 


~

H ( y | x) 
H (Y | X )   p ( xs ) log p ( yk | xs ) 
p( y | x)
p( yk | xs )
p ( y k | xs ) 

 ~
p ( x )(log p( y | x )  1)
35
s
k
s
P( X  xi )  pi , i  1 ~ l
P(Y  y j | X  xi )  pij
j 1~ n
H ( X , Y )  H l n  H l n ( p1 p11 , p1 p12 ,..., p1 p1n , p2 p21 , p2 p22 ,..., p2 p2 n , pl pl1 , pl pl 2 ,..., pl pl n )
l
l
n
n
  pi pij log pi pij   P( X  xi ) P(Y  y j | X  xi ) log P( X  xi ) P(Y  y j | X  xi )
i 1 j 1
i 1 j 1
 ...

 n

 H n ( p1 ,..., pl )   pi H n ( pi1 ,..., pin )  H ( X )   pi    pij log pij 
i 1
i 1

 j 1
l
l
l
 H ( X )   pi H (Y | X  xi )
i 1
 H ( X )  H (Y | X )
36

 n

H (Y | X )   pi H (Y | X  xi )   pi    pij log pij 
i 1
i 1

 j 1
l

 n

  P( X  xi )   P(Y  y j | X  xi ) log P(Y  y j | X  xi ) 
i 1

 j 1
l
l
l
n
  P( X  xi ) P(Y  y j | X  xi ) log P(Y  y j | X  xi )
i 1 j 1
l
n
  p ( xi ) p ( y j | xi ) log p( y j | xi )    p ( x) p( y | x) log p( y | x)
i 1 j 1

xX yY
 p( x) p( y | x) log p( y | x)
( x , y )Z
37
(22)
m

~

(
E
(
f
)

E
( f i )) 

i
i
p( yk | xs ) i 1
m




~
~
i   p ( x) p( y | x) f i ( x, y )    p ( x, y ) f i ( x, y )   


p( yk | xs ) i 1  xX yY
 xX yY

m
~

 i p ( xs ) f i ( xs , y k )
i 1
38



m 1   p ( y | x)  1
p ( y | x)
 yY

i  1 ~ l




m 1   p( y | x)  1
p ( yk | xs )
 yY

 n




m 1   P(Y  y j | X  xi )  1
p ( yk | xs )
 j 1

when i  s
 m 1
other description
l
 n




m i   P (Y  y j | X  xi )  1

p ( yk | xs ) i 1
 j 1


m1 p( y1 | x1 )  ... m s p( yk | xs )  ...  ml p( yn | xl ) 

p ( yk | xs )
 m  s
39
• The complete derivation of the Lagrange function
from equation 19 is then
(23)
• Equating this term to 0 and solving by p(y|x)
leads to
(24)
40
• The second constraint in equation 18 is given as
(25)
• Substituting equation 24 into 25 results in
(26)
• Substituting equation 26 back into equation 24
results in
(27)
41
• This is the general form the model needs to have to
meet the constraints. The Maximum Entropy Model
can then be formulated as
(28)
(29)
• This formulation of a conditional probability
distribution as a log-linear model and a product of
exponentiated weighted features is discussed from
another perspective in Section 3
• In Section 4, the similarity of Conditional Random
Fields, which are also log-linear models, with the
conceptually closely related Maximum Entropy
Models becomes evident
42
Lagrange Multiplier Method
張海潮 教授/臺灣大學數學系
http://episte.math.ntu.edu.tw/entries/en_lagrang
e_mul/index.html
Presented by Jian-Shiun Tzeng 4/24/2009
43
• 在兩個變數的時候,要找 f(x,y) 的極值的一個
必要的條件是: f f
x

x
0
(L1)
• 但是如果 x,y 的範圍一開始就被另一個函數
g(x,y)=0 所限制,Lagrange 提出以 f ( x, y)  g ( x, y)
對 x 和 y 的偏導數為 0,來代替(L1)作為在
g(x,y)=0 上面尋找 f(x,y) 極值的條件
• 式中引入的 λ 是一個待定的數,稱為乘數,因
為是乘在 g 的前面而得名。
44
• 首先我們注意,要解的是 x,y 和 λ 三個變數,
而
f
g

0
x
x
f
g

0
y
y
g ( x, y )  0
• 雖然有三個方程式,原則上是可以解得出來的。
45
• 以 f(x,y)=x, g(x,y)=x2+y2-1 為例,當 x,y 被限制
在 x2+y2-1=0 上活動時,對下面三個方程式求
解

[ x   ( x 2  y 2  1)]  1  2x
x

2
2
[ x   ( x  y  1)]  2x
y
x2  y 2 1  0
• 答案有兩組,分別是 x=1,y=0,λ=-1/2和 x=-1,
y=0,λ=1/2 。 對應的是 x2+y2-1=0 這個圓的左、
右兩個端點。它們的 x 坐標分別是 1和 -1,一
個是最大可能,另一個是最小可能。
46
• 讀者可能認為為何不把 x2+y2-1=0 這個限制改
寫為 x  cos  、y  sin  來代入得到
f ( x, y)  f (cos , sin  )
,然後令對 θ 的微分等於 0 來求解呢?對以上的
這個例子而言,當然是可以的,但是如果 g(x,y)
是相當一般的形式,而無法以 x,y 的參數式代
入滿足,或是再更多變數加上更多限制的時候,
舊的代參數式方法通常是失效的註1。
47
註1
• 如果在 g1(x,y,z)=0,g2(x,y,z)=0 這樣的限制之下
求f(x,y,z) 的極值。Lagrange 乘數法需列出下面

五個方程式
[f  g  g ]0
x

[ f  1 g1  2 g 2 ]  0
y

[ f  1 g1  2 g 2 ]  0
z
g1  0
1 1
2
2
g2  0
• 要解的變數有 x,y,z 和 λ1, λ2一共五個。
48
• 這個方法的意義為何?原來在 g(x,y)=0 的時候,
不妨把 y 想成是 x 的隱函數,而有 g(x,y(x))=0
,並且 f(x,y) 也變成了 f(x,y(x))。令 d f ( x, y ( x))  0
dx
根據連鎖法則,我們得到
dy
fx  fy
0
dx
dy
gx  gy
0
dx
d
( g ( x, y ( x))  0)
dx
• 因此有行列式為 0 的結論。
fx
gx
fy
0
gy
49
• 這表示 fx,fy 和 gx,gy 成比例,所以有 λ註2
f x  g x
f y  g y
50
註2
• 當然也可以寫成 f x  g x  0, f y  g y  0 。只
是要注意,這裡有一個先決條件就是 gx,gy 不
能同時為 0,同時為 0 會使 y 和 x 無法表成對
方的反函數,請看下面的例子:設 f(x,y)=x,
g(x,y)=y2-x3,則 
[ x   ( y 2  x 3 )]  1  3x 2  0
x

2
3
[ x   ( y  x )]  2y
y
y 2  x3  0
• 易見這個方程式無解,但 f(x,y) 的極值是有的,
它發生在 (0,0) 之處,只不過在 (0,0),gx=0=gy,
所以乘數法是失效的。
51
Abstract
•
•
•
•
•
Introduction
Naïve Bayes
HMM
ME
CRF
52