Transcript NP PP
Chapter 14: Statistical Parsing
Heshaam Faili
[email protected]
University of Tehran
Motivation and Outline
Previously, we used CFGs to parse with, but:
Some ambiguous sentences could not be
disambiguated, and we would like to know the
most likely parse
How do we get such grammars? Do we write
them ourselves? Maybe we could use a corpus …
Where we’re going:
Probabilistic Context-Free Grammars (PCFGs)
Lexicalized PCFGs
2
Probabilistic Context-Free
Grammars (PCFGs)
We’ll have a CFG augmented with
probabilities
Used for disambiguation (attachment and
Coordination)
Used for Language Modeling that is estimate
the probability of using a sentence (S) in the
language
We used N-gram for this before … but using
PCFGs is more likely to do better
3
Probabilistic Context-Free
Grammars (PCFGs)
Definition of a CFG:
Set of non-terminals (N)
Set of terminals (T)
Set of rules/productions (P), of the form Α β
Designated start symbol (S)
Definition of a PCFG:
Same as a CFG, but with one more function, D
D assigns probabilities to each rule in P
4
Probabilities
The function D gives probabilities for a non-terminal
A to be expanded to a sequence β.
The idea is that, given A as the mother non-terminal
(LHS), what is the likelihood that β is the correct RHS
Written as P(A β)
or as P(A β|A)
Note that Σi (A βi | A) = 1
For example, we would augment a CFG with these
probabilities:
P(S NP VP | S) = .80
P(S Aux NP VP | S) = .15
P(S VP | S) = .05
5
An Example
6
Using Probabilities to Parse
P(T): Probability of a particular
parse tree
P(T,S) = ΠnєT p( r(n) ) =
P(T).P(S|T)
but P(S|T) = 1 ?
P(T) = ΠnєT p( r(n) )
i.e., the product of the probabilities
of all the rules r used to expand
each node n in the parse tree
7
8
Disambiguated Parse Tree
9
Using probabilities
So, the probability for that parse is 0.000015. What’s
the big deal?
Whereas we couldn’t decide between two parses
using a regular CFG, we now can.
For example, TWA flights is ambiguous between being
two separate NPs (cf. I gave [NP John] [NP money]) or
one NP:
Probabilities are useful for comparing with other probabilities
A: [book [TWA] [flights]]
B: [book [TWA flights]]
Probabilities allows us to choose choice B
10
Language Modeling
Bigram estimation
11
Language modeling and Ngram
N-gram problems (predicting after):
the contract ended with a loss of 7 cents
after trading as low as 9 cents
Use parse tree in order to predict words
based on useful information
(independent than words)
12
Obtaining the best parse
Call the best parse T(S), where S is
your sentence
Get the tree which has the highest
probability, i.e.
T(S) = argmaxTєparse-trees(S) P(T)
Can use the Cocke-Younger-Kasami
(CYK) algorithm to calculate best parse
CYK is a form of dynamic programming
CYK is a chart parser, like the Earley parser13
The Probabilistic CYK
algorithm
Base case
Add words to the chart
Store P(A wi) for every category A in the chart
Recursive case makes this dynamic programming
because we only calculate B and C once
Rules must be of the form A BC, i.e., exactly two items on
the RHS (we call this Chomsky Normal Form (CNF))
Get the probability for A at this node by multiplying the
probabilities for B and for C by P(A BC)
P(B)*P(C)*P(A BC)
For a given A, only keep the maximum probability
(again, this is dynamic programming)
14
PCYK pseudo-code
15
Example: The flight includes a meal
16
Learning Probabilities using a
Treebank
Given a corpus of sentences annotated with syntactic
annotation (e.g., the Penn Treebank)
Consider all parse trees
(1) Each time you have a rule of the form Aβ applied in a
parse tree, increment a counter for that rule
(2) Also count the number of times A is on the left hand side
of a rule
If you don’t have annotated data, but we have a nonprobabilistic CFG, then
17
Learning Probabilities from a
non-probabilistic CFG
parse the corpus using the CFG, But …
Most sentences are ambiguous (have multiple
parses)
Use all parses and weight them, them count the
using of rules based on weights of parses, But how
we can estimate the weights (chicken and egg)
Incrementally improve our estimates by beginning
the parser with equal rules and literately compute the
weighted rule counts
Inside-outside algorithm (special case of EM algorithm, like
forward-backward)
E-step: Expectation Step
M-step: Maximization Step
18
Problems with PCFGs
(poor independence assumptions)
It’s still only a CFG, so dependencies on nonCFG info not captured
English e.g., Pronouns are more likely to be
subjects than objects:
Persian e.g. pro drop are very likely to be when it’s
subject
P[(NPPronoun) | NP=subj] >> P[(NPPronoun) | NP =obj]
19
Problems with PCFGs (Ignore
lexical information)
Ignores lexical information (statistics), which
is usually crucial for disambiguation
(T1) America sent [[250,000 soldiers] [into Iraq]]
(T2) America sent [250,000 soldiers] [into Iraq]
send with into-PP always attached high (T2) in PTB!
To handle lexical information, we’ll turn to
lexicalized PCFGs
20
Ignore lexical information
VP VBD NP PP
VP VBD NP
NP NP PP
21
Ignore lexical information
Based on the probability of rules (VP VBD NP PP)
and (VP VBD NP NP NP PP) one of these
attachments (VP or NP) are accepted for every PP
independent of the lexical
NP attachment are more likely to be used but …
“fishermen caught tons of herring”
affinity between “dumps”, “into” is greater than
“sacks”, “into” => VP attachments
But affinity between “tons”, “of” is greater than
“caught”, “of” => NP attachments
Model Lexical dependency
22
Handling Coordination
ambiguity
23
Dealing with PCFG problems:
SPLITTING AND MERGING
For handling the problem of the fact
that NPs in subject position tend to be
pronoun: split the NP to NPsubject,
NPobject
parent annotation:split the nonterminal based on parent non-terminal
NP^VP : object
NP^S : subject
24
Parent annotation
25
Splitting the pre-terminals
“If” prefer sentential complements than
NP complements
26
Split and Merge
Node-splitting is not without problems; it
increases the size of the grammar, and hence
reduces the amount of training data available
for each grammar rule, leading to overfitting
Choosing the correct granularity is the key
Split and Merge iteratively
Start with simple grammar
alternately splits the non-terminals and merges
together non-terminals, finding the set of
annotated nodes which maximizes the likelihood
of the training set treebank
27
Probabilistic Lexicalized PCFGs
Remember how we talked about head information
being passed up in a syntactic analysis?
Well, if you follow this down all the way to the
bottom of a tree, you wind up with a head word
In some sense, we can say that Book that flight is
not just an S, but an S rooted in book
Thus, book is the headword of the whole sentence
By adding headword information to nonterminals, we
wind up with a lexicalized grammar
28
Probabilistic Lexicalized PCFGs
make a further extension to associate
the head tag (pos of head)
29
Incorporating Head
Probabilities: Wrong Way
Simply adding headword w to node won’t work:
The probabilities are too small, i.e., we don’t have a
big enough corpus to calculate these probabilities
So, the node A becomes A[w]
e.g., P(A[w]β|A) =Count(A[w]β)/Count(A)
VP(dumped) VBD(dumped) NP(sacks) PP(into) 3x10-10
VP(dumped) VBD(dumped) NP(cats) PP(into) 8x10-11
These probabilities are tiny, and others will never
occur
30
31
Lexicalized Grammars
Best Results until now,
Collins Parser
Charniak Parser
32
Incorporating head probabilities:
Right way (Collins Parser)
right-hand side of every (internal) CFG
rule as consisting of a head nonterminal, together with the nonterminals to the left of the head, and
the non-terminals to the right of the
head
33
Collins Parser
Instead of simple MLE probability break
down the rule with a generative story
given the left-hand side, we first generate
the head of the rule, and then generate
the dependents of the head, one by one,
from the inside out
Adding STOP in begin and end of RHS
of the rules
34
Generative Story
35
Collins Parser
36
Advanced Features on Collins
Distance feature
37
Evaluating Parser Output
Dependency relations are also useful for comparing
parser output to a treebank
Traditional measures of parser accuracy:
Cross bracket:
((A B) C) ? (A (B C))
38
DISCRIMINATIVE RERANKING
generative parsers: probabilistic model gives us the probability
of generating a particular sentence by assigning a probability to
each choice the parser
But generative parsing models also make it hard to incorporate
arbitrary kinds of information into the probability
Right-branching in English
Two-Stage Discriminative reranking
N-best list: output N-best parses instead of one single parse.
Extract features : parse probability assigned by the first-stage
statistical parser, each of the CFG rules in the tree, the number
of parallel conjuncts,
how heavy each constituent is, measures of how right-branching
the parse tree is, how many times various tree fragments occur,
bigrams of adjacent non-terminals in the tree
39
PARSER-BASED LANGUAGE
MODELING
Using Two-Stage modeling
First Stage: using a normal N-gram
grammar and output N-best output
using the n-gram modeling
Second Stage: we run our statistical
parser and assign a parse probability to
each sentence in the N-best list or
lattice.
40
Practice
14.5, 14.6
41