LARG-20010510

Download Report

Transcript LARG-20010510

Grammar induction by Bayesian
model averaging
Guy Lebanon
LARG meeting
May 2001
Based on Andreas Stolcke’s thesis UC Berkeley 1994
Why automatic grammar
induction (AGI)
• Enables using domain-dependent grammars without expert
intervention.
• Enables using person-dependent grammars without expert
intervention.
• Can be used on different languages (without a linguist
familiar with the particular language).
• A process of grammar induction with expert guidance may
be more accurate than human written grammar since
computers are more adept than humans in analyzing large
corpora.
Why statistical approaches to
AGI
• In practice languages are not logical structures.
• Often said sentences are not precisely grammatical. The
solution of expanding the grammar leads to explosion of
grammar rules.
• A large grammar will lead to many parses of the same
sentences. Clearly, some parses are more accurate than
others. Statistical approaches enable including a large set
of grammar rules together with assigning probability to
each parse.
• There are known optimality conditions and optimization
procedure in statistics.
Some Bayesian statistics
M  ( , s)
• For each grammar
(rule probabilities+ rules), a
prior probability p(M) is assigned. This value may
represent experts’ opinion about how likely is this
grammar.
• Upon introduction of a training set X (an unlabeled
corpus), the model posterior is computed by Bayes law:
p( M | C ) 
p ( M ) p (C | M )
 p ( M ) p (C | M )
p (C )
• Either the grammar that maximizes the posterior is kept (as
the best grammar), or the set of all grammars and their
posteriors is kept (better).
Priors for CF grammars
• The prior of a grammar p(M) is split to two parts:
p( M )  p( , s)  p( s) p( | s)
• The component p(s ) is taken to introduce a bias towards
short grammars (less rules). One way of doing that, though
still heuristic, is minimum description length (MDL):
p ( s )  e l ( s )
• Prior for the rule probabilities is taken to be uniform
Dirichlet prior which has the effect of smoothing low
counts of rules usage.
Grammar posterior
• Too hard to maximize over the posterior of both the rules
and the probabilities. Instead, the search is done to
maximize the posterior of the rules only:
p ( s ) p (C | s ) p ( s )
p( s | C ) 

p (C ,  | s ) 

p (C )
p (C ) 
p( s )
p( s )
p
(

|
s
)
p
(
C
|

,
s
)

p( | s ) 
p( x |  , s )
















p(C )  Dirichlet prior sum over all derivations p(C )  Dirichlet prior xC sum over
all derivations of x

p( s )
p( | s ) p(V ( x ) |  , s )


 

p (C ) 
xC prob. of Viterbi derivation
• Where V is the Viterbi derivation of x. The last integral has
a closed form solution.
Maximizing the posterior
• Even though computing an approximation to the posterior
is possible in closed form, coming up with a grammar that
maximizes it is still a hard problem.
• A. Stolcke: Start with many rules. Apply greedy operations
of merging rules to maximize the posterior.
• Model merging was applied to Hidden Markov models,
probabilistic context free grammar and probabilistic
attribute grammar (PCFG with semantic features tied to
non-terminals).
A concrete example: PCFG
• A specific PCFG consists of a list of rules s and a set of
production probabilities  .
• For a given s, it is possible to learn the production
probabilities with EM. Coming up with an optimal s is still
an open problem. Stolcke’s model merging is an attempt to
tackle this problem.
• Given a corpus (set of sentences), an initial set of rules is
constructed:
w w ...w in the corpus add the rules
1
2
l
S  X 1 X 2... X l
X 1 w1 ,..., X l wl
:
Merging operators
• Non-terminal merging: replace two existing non-terminals
with a single new non-terminal.
Z1  1 X11,
Z1  1 X 2 1
X1  1,
X 2  1
merge( X1, X 2 )  Y
Z1  1Y1,
Z1  1Y1
Y  1,
Y  2
• Non-terminal chunking: Given an ordered sequence of nonterminals, X 1 X 2 ... X k create a new non-terminal Y that
expands to X 1 X 2 ... X k and replaces occurrences of X 1 X 2 ... X k
in right hand side with Y.
Z  X 1 X 2 ... X k 
chunk ( X 1,..., X k )  Y
Z  Y
Y  X 1 X 2 ... X k
PCFG priors
Prior for rule probabilities:
p( | s)   p(i | s)  cii 1
i
i
Prior for rules:
For a non-lexical rule (doesn’t produce a terminal
symbol) the description length is
descriptio n _ length( X  Y1Y2 ...Yk )  (k  1) log(|  | 1)
For a lexical rule (produces a terminal symbol) the
description length is descriptio n _ length( X  a )  log |  |
The prior was taken to be either exponentially
decreasing or Poisson in the description length
p( s )  e
l ( s )
e

l ( r )
rs
e   l ( s )
p( s ) 
l ( s )!
Search strategy
•
•
Start with the initial rules.
Try applying all possible merge operations. For each
resulting grammar compute the posterior and choose the
merge which resulted in the highest posterior.
•
Search strategy:
Best first search,
Best first with look-ahead
Beam search
Now some examples…