Y - Institut für Informationssysteme

Download Report

Transcript Y - Institut für Informationssysteme

Web-Mining Agents
Prof. Dr. Ralf Möller
Universität zu Lübeck
Institut für Informationssysteme
Tanya Braun (Übungen)
Discriminative Training
Markov Networks and
Conditional Random Fields (CRFs)
Raymond J. Mooney
University of Texas at Austin
2
Bayesian Categorization
• Determine category of xk by determining for each yi
P(Y = yi | X = xk ) =
P(Y = yi )P(X = xk |Y = yi )
= a P(Y = yi )P(X = xk |Y = yi )
P(X = xk )
• P(X=xk) can be determined since categories are
complete and disjoint.
m
m
i 1
i 1
 P(Y  yi | X  xk )  
m
P(Y  yi ) P( X  xk | Y  yi )
1
P( X  xk )
P(X = xk ) = å P(Y = yi )P(X = xk |Y = yi ) = Z
a =1/ Z
i=1
5
Bayesian Categorization (cont.)
• Need to know:
– Priors: P(Y=yi)
– Conditionals: P(X=xk | Y=yi)
• P(Y=yi) are easily estimated from data.
– If ni of the examples in D are in yi then P(Y=yi) = ni / |D|
• Too many possible instances (e.g. 2n for binary
features) to estimate all P(X=xk | Y=yi).
• Still need to make some sort of independence
assumptions about the features to make learning
tractable.
6
Naïve Bayes Generative Model
neg
pos pos
pos neg
pos neg
Category
med
sm lg
med
lg lg sm
sm med
red
blue
redgrn red
red blue
red
circ
tri tricirc
circ circ
circ sqr
lg
sm
med med
sm lglg
sm
red
blue
grn grn
red blue
blue grn
circ
sqr
tri circ
circtri sqr
sqr tri
Size
Color
Shape
Size
Color
Shape
Positive
Negative
7
Naïve Bayes Inference Problem
lg red circ
??
??
neg
pos pos
pos neg
pos neg
Category
med
sm lg
med
lg lg sm
sm med
red
blue
redgrn red
red blue
red
circ
tri tricirc
circ circ
circ sqr
lg
sm
med med
sm lglg
sm
red
blue
grn grn
red blue
blue grn
circ
sqr
tri circ
circtri sqr
sqr tri
Size
Color
Shape
Size
Color
Shape
Positive
Negative
8
Naïve Bayesian Categorization
• If we assume features of an instance are independent given
the category (conditionally independent).
n
P( X | Y )  P( X 1 , X 2 , X n | Y )   P( X i | Y )
i 1
• Therefore, we then only need to know P(Xi | Y) for each
possible pair of a feature-value and a category.
• If Y and all Xi and binary, this requires specifying only 2n
parameters:
– P(Xi=true | Y=true) and P(Xi=true | Y=false) for each Xi
– P(Xi=false | Y) = 1 – P(Xi=true | Y)
• Compared to specifying 2n parameters without any
independence assumptions.
9
Generative vs. Discriminative Models
•
Generative models are not directly designed to maximize
the performance of classification. They model the complete
joint distribution P(X,Y).
• Classification is then done using Bayesian inference given
the generative model of the joint distribution.
• But a generative model can also be used to perform any other
inference task, e.g. P(X1 | X2, …Xn, Y)
– “Jack of all trades, master of none.”
• Discriminative models are specifically designed and trained
to maximize performance of classification. They only model
the conditional distribution P(Y | X).
• By focusing on modeling the conditional distribution, they
generally perform better on classification than generative
models when given a reasonable amount of training data.
10
Logistic Regression (for discrete vars)
• Assumes a parametric form for directly estimating
P(Y | X). For binary concepts, this is:
P(Y  1 | X ) 
1
1  exp( w0  i 1 wi X i )
n
P(Y  0 | X )  1  P(Y  1 | X )
exp( w0  i 1 wi X i )
n

1  exp( w0  i 1 wi X i )
n
• Equivalent to a one-layer backpropagation neural net.
– Logistic regression is the source of the sigmoid function
used in backpropagation.
– Objective function for training is somewhat different.
Logistic Regression as a Log-Linear Model
• Logistic regression is basically a linear model, which
is demonstrated by taking logs.
P(Y  0 | X )
P(Y  1 | X )
n
1  exp( w0  i 1 wi X i )
Assign label Y  0 iff 1 
0  w0  i 1 wi X i
n
or equivalent ly w0  i 1  wi X i
n
• Also called a maximum entropy model (MaxEnt)
because it can be shown that standard training for
logistic regression gives the distribution with maximum
entropy that is consistent with the training data.
Logistic Regression Training
• Weights are set during training to maximize the
conditional data likelihood :
W  argmax  P(Y d | X d ,W )
d D
W
where D is the set of training examples and Yd and
Xd denote, respectively, the values of Y and X for
example d.
• Equivalently viewed as maximizing the
conditional log likelihood (CLL)
W  argmax
W
d
d
ln
P
(
Y
|
X
,W )

d D
Logistic Regression Training
• Like neural-nets, can use standard gradient
descent to find the parameters (weights) that
optimize the CLL objective function.
• Many other more advanced training
methods are possible to speed up
convergence.
Preventing Overfitting in Logistic Regression
• To prevent overfitting, one can use regularization
(a.k.a. smoothing) by penalizing large weights by
changing the training objective:
W  argmax
W
 ln P(Y
d D
d
| X ,W ) 
d

2
W
2
Where λ is a constant that determines the amount of smoothing
• This can be shown to be equivalent to MAP
parameter estimation assuming a Guassian prior
for W with zero mean and a variance related to
1/λ.
Multinomial Logistic Regression
(MaxEnt)
• Logistic regression can be generalized to multi-class
problems (where Y has a multinomial distribution).
• Create feature functions for each combination of a
class value y´ and each feature Xj and another for the
“bias weight” of each class.
– f y´, j (Y, X) = Xj if Y= y´ and 0 otherwise
– f y´ (Y, X) = 1 if Y= y´ and 0 otherwise
• The final conditional distribution is:
K
1
P(Y | X ) 
exp(  k f k (Y , X ))
Z(X )
k 1
(λk are weights)
K
Z ( X )   exp(  k f k (Y , X ))
Y
k 1
(normalizing constant)
Graphical Models
• If no assumption of independence is made, then an
exponential number of parameters must be estimated for
sound probabilistic inference.
– No realistic amount of training data is sufficient to estimate so many
parameters.
• If a blanket assumption of conditional independence is made,
efficient training and inference is possible, but such a strong
assumption is rarely warranted.
• Graphical models use directed or undirected graphs over a
set of random variables to explicitly specify variable
dependencies and allow for less restrictive independence
assumptions while limiting the number of parameters that
must be estimated.
– Bayesian Networks: Directed acyclic graphs that indicate causal
structure.
– Markov Networks: Undirected graphs that capture general
dependencies.
Bayesian Networks
• Directed Acyclic Graph (DAG)
– Nodes are random variables
– Edges indicate causal influences
Earthquake
Burglary
Alarm
JohnCalls
MaryCalls
Conditional Probability Tables
• Each node has a conditional probability table (CPT) that
gives the probability of each of its values given every possible
combination of values for its parents (conditioning case).
– Roots (sources) of the DAG that have no parents are given prior
probabilities.
P(B)
.001
P(E)
Earthquake
Burglary
Alarm
A
P(J)
T
.90
F
.05
JohnCalls
B
E
P(A)
T
T
.95
T
F
.94
F
T
.29
F
F
.001
MaryCalls
.002
A
P(M)
T
.70
F
.01
Joint Distributions for Bayes Nets
• A Bayesian Network implicitly defines a joint
distribution.
n
P( x1 , x2 ,...xn )   P( xi | Parents ( X i ))
i 1
• Example
P( J  M  A  B  E )
 P( J | A) P( M | A) P( A | B  E ) P(B) P(E )
 0.9  0.7  0.001 0.999  0.998  0.00062
Naïve Bayes as a Bayes Net
• Naïve Bayes is a simple Bayes Net
Y
X1
X2
…
Xn
• Priors P(Y) and conditionals P(Xi|Y) for
Naïve Bayes provide CPTs for the network.
Markov Networks
• Undirected graph over a set of random
variables, where an edge represents a
dependency.
• The Markov blanket of a node, X, in a
Markov Net is the set of its neighbors in the
graph (nodes that have an edge connecting
to X).
• Every node in a Markov Net is
conditionally independent of every other
node given its Markov blanket.
Distribution for a Markov Network
• The distribution of a Markov net is most compactly described
in terms of a set of potential functions (a.k.a. factors,
compatibility functions), φk, for each clique, k, in the graph.
• For each joint assignment of values to the variables in clique
k, φk assigns a non-negative real value that represents the
compatibility of these values.
• The joint distribution of a Markov network is then defined by:
1
P( x1 , x2 ,...xn )  k ( x{k } )
Z k
Where x{k} represents the joint assignment of the variables
in clique k, and Z is a normalizing constant that makes a
joint distribution that sums to 1.
Z  k ( x{k } )
x
k
Sample Markov Network
B
T
T
F
F
A
T
F
T
F
1
100
1
1
200
Earthquake
Burglary
E
T
T
F
F
A
T
F
T
F
2
50
10
1
200
Alarm
J
T
T
F
F
A
T
F
T
F
3
75
10
1
200
JohnCalls
MaryCalls
M
T
T
F
F
A
T
F
T
F
4
50
1
10
200
Logistic Regression as a Markov Net
• Logistic regression is a simple Markov Net
Y
X1
X2
…
Xn
• But only models the conditional distribution,
P(Y | X) and not the full joint P(X,Y)
Markov Nets vs. Bayes Nets
• Markov
1
P( x1 , x2 ,...xn )  k ( x{k } )
Z k
Z  k ( x{k } )
x
• Bayes
k
n
P( x1 , x2 ,...xn )   P( xi | Parents ( X i ))
i 1
P(Y  yi | X  xk ) 
P(Y  yi ) P( X  xk | Y  yi )
P ( X  xk )
m
P( X  xk )   P(Y  yi ) P( X  xk | Y  yi )
i 1
26
Markov Nets vs. Bayes Nets
Property
Markov Nets
Bayes Nets
Form
Prod. potentials
Prod. potentials
Potentials
Arbitrary
Cond. probabilities
Cycles
Allowed
Forbidden
Partition func. Z = ? global
Indep. check
Normalization local
Graph separation D-separation
Indep. props. Some
Some
Inference
MCMC etc.
MCMC etc.
Generative vs. Discriminative
Sequence Labeling Models
• HMMs are generative models and are not directly
designed to maximize the performance of sequence
labeling. They model the joint distribution P(O,Q).
• HMMs are trained to have an accurate probabilistic
model of the underlying language, and not all
aspects of this model benefit the sequence labeling
task.
• Conditional Random Fields (CRFs) are
specifically designed and trained to maximize
performance of sequence labeling. They model the
conditional distribution P(Q | O)
Classification
Y
X1
X2
…
Naïve
Bayes
Xn
Generative
Conditional
Discriminative
Y
X1
X2
…
Logistic
Regression
Xn
Sequence Labeling
Y1
Y2
..
YT
HMM
X1
X2
…
XT
Generative
Conditional
Discriminative
Y1
Y2
..
YT
X1
X2
…
XT
Linear-chain CRF
Simple Linear Chain CRF Features
• Modeling the conditional distribution is
similar to that used in multinomial logistic
regression.
• Create feature functions fk(Yt, Yt−1, Xt)
– Feature for each state transition pair i, j
• fi,j(Yt, Yt−1, Xt) = 1 if Yt = i and Yt−1 = j and 0 otherwise
– Feature for each state observation pair i, o
• fi,o(Yt, Yt−1, Xt) = 1 if Yt = i and Xt = o and 0 otherwise
• Note: number of features grows quadratically
in the number of states (i.e. tags).
31
Conditional Distribution for
Linear Chain CRF
• Using these feature functions for a simple
linear chain CRF, we can define:
T K
1
P(Y | X ) 
exp(  k f k (Yt , Yt 1 , X t ))
Z(X )
t 1 k 1
T
K
Z ( X )   exp(  k f k (Yt , Yt 1 , X t ))
Y
t 1 k 1
32
Adding Token Features to a CRF
• Can add token features Xi,j
X1,1
…
…
Y2
Y1
X1,m
X2,1
…
X2,m
…
YT
XT,1
…
XT,m
• Can add additional feature functions for
each token feature to model conditional
distribution.
33
Features in POS Tagging
• For POS Tagging, use lexicographic
features of tokens.
– Capitalized?
– Start with numeral?
– Ends in given suffix (e.g. “s”, “ed”, “ly”)?
34
Enhanced Linear Chain CRF
(standard approach)
• Can also condition transition on the current
token features.
…
Y2
Y1
X1,1
X2,1
…
X2,m
XT,1
…
…
…
X1,m
YT
XT,m
• Add feature functions:
• fi,j,k(Yt, Yt−1, X) 1 if Yt = i and Yt−1 = j and Xt −1,k = 1
and 0 otherwise
35
Supervised Learning
(Parameter Estimation)
• As in logistic regression, use L-BFGS
optimization procedure, to set λ weights to
maximize CLL of the supervised training
data.
• See paper for details.
36
Sequence Tagging
(Inference)
• Variant of Viterbi algorithm can be used to
efficiently, O(TN2), determine the globally
most probable label sequence for a given
token sequence using a given log-linear
model of the conditional probability P(Y | X).
• See paper for details.
37
Skip-Chain CRFs
• Can model some long-distance dependencies (i.e. the
same word appearing in different parts of the text) by
including long-distance edges in the Markov model.
Y1
Y2
X1
X2
Michael
Dell
Y3
X3
said
Y100
Y101
X100
X101
…
Dell bought
• Additional links make exact inference intractable,
so must resort to approximate inference to try to
find the most probable labeling.
38
CRF Results
• Experimental results verify that they have superior
accuracy on various sequence labeling tasks.
–
–
–
–
Part of Speech tagging
Noun phrase chunking
Named entity recognition
Semantic role labeling
• However, CRFs are much slower to train and do
not scale as well to large amounts of training data.
– Training for POS on full Penn Treebank (~1M words)
currently takes “over a week.”
• Skip-chain CRFs improve results on IE.
39
CRF Summary
• CRFs are a discriminative approach to sequence
labeling whereas HMMs are generative.
• Discriminative methods are usually more accurate
since they are trained for a specific performance task.
• CRFs also easily allow adding additional token features
without making additional independence assumptions.
• Training time is increased since a complex
optimization procedure is needed to fit supervised
training data.
• CRFs are a state-of-the-art method for sequence
labeling.
40