Transcript NP PP

Chapter 14: Statistical Parsing
Heshaam Faili
[email protected]
University of Tehran
Motivation and Outline

Previously, we used CFGs to parse with, but:



Some ambiguous sentences could not be
disambiguated, and we would like to know the
most likely parse
How do we get such grammars? Do we write
them ourselves? Maybe we could use a corpus …
Where we’re going:


Probabilistic Context-Free Grammars (PCFGs)
Lexicalized PCFGs
2
Probabilistic Context-Free
Grammars (PCFGs)



We’ll have a CFG augmented with
probabilities
Used for disambiguation (attachment and
Coordination)
Used for Language Modeling that is estimate
the probability of using a sentence (S) in the
language

We used N-gram for this before … but using
PCFGs is more likely to do better
3
Probabilistic Context-Free
Grammars (PCFGs)

Definition of a CFG:





Set of non-terminals (N)
Set of terminals (T)
Set of rules/productions (P), of the form Α  β
Designated start symbol (S)
Definition of a PCFG:


Same as a CFG, but with one more function, D
D assigns probabilities to each rule in P
4
Probabilities

The function D gives probabilities for a non-terminal
A to be expanded to a sequence β.



The idea is that, given A as the mother non-terminal
(LHS), what is the likelihood that β is the correct RHS


Written as P(A  β)
or as P(A  β|A)
Note that Σi (A  βi | A) = 1
For example, we would augment a CFG with these
probabilities:



P(S  NP VP | S) = .80
P(S  Aux NP VP | S) = .15
P(S  VP | S) = .05
5
An Example
6
Using Probabilities to Parse



P(T): Probability of a particular
parse tree
P(T,S) = ΠnєT p( r(n) ) =
P(T).P(S|T)
but P(S|T) = 1 ?
P(T) = ΠnєT p( r(n) )
i.e., the product of the probabilities
of all the rules r used to expand
each node n in the parse tree
7
8
Disambiguated Parse Tree
9
Using probabilities

So, the probability for that parse is 0.000015. What’s
the big deal?



Whereas we couldn’t decide between two parses
using a regular CFG, we now can.
For example, TWA flights is ambiguous between being
two separate NPs (cf. I gave [NP John] [NP money]) or
one NP:



Probabilities are useful for comparing with other probabilities
A: [book [TWA] [flights]]
B: [book [TWA flights]]
Probabilities allows us to choose choice B
10
Language Modeling
Bigram estimation
11
Language modeling and Ngram

N-gram problems (predicting after):


the contract ended with a loss of 7 cents
after trading as low as 9 cents
Use parse tree in order to predict words
based on useful information
(independent than words)
12
Obtaining the best parse

Call the best parse T(S), where S is
your sentence



Get the tree which has the highest
probability, i.e.
T(S) = argmaxTєparse-trees(S) P(T)
Can use the Cocke-Younger-Kasami
(CYK) algorithm to calculate best parse


CYK is a form of dynamic programming
CYK is a chart parser, like the Earley parser13
The Probabilistic CYK
algorithm

Base case



Add words to the chart
Store P(A  wi) for every category A in the chart
Recursive case  makes this dynamic programming
because we only calculate B and C once


Rules must be of the form A  BC, i.e., exactly two items on
the RHS (we call this Chomsky Normal Form (CNF))
Get the probability for A at this node by multiplying the
probabilities for B and for C by P(A  BC)


P(B)*P(C)*P(A  BC)
For a given A, only keep the maximum probability
(again, this is dynamic programming)
14
PCYK pseudo-code
15
Example: The flight includes a meal
16
Learning Probabilities using a
Treebank

Given a corpus of sentences annotated with syntactic
annotation (e.g., the Penn Treebank)




Consider all parse trees
(1) Each time you have a rule of the form Aβ applied in a
parse tree, increment a counter for that rule
(2) Also count the number of times A is on the left hand side
of a rule
If you don’t have annotated data, but we have a nonprobabilistic CFG, then
17
Learning Probabilities from a
non-probabilistic CFG




parse the corpus using the CFG, But …
Most sentences are ambiguous (have multiple
parses)
Use all parses and weight them, them count the
using of rules based on weights of parses, But how
we can estimate the weights (chicken and egg)
Incrementally improve our estimates by beginning
the parser with equal rules and literately compute the
weighted rule counts



Inside-outside algorithm (special case of EM algorithm, like
forward-backward)
E-step: Expectation Step
M-step: Maximization Step
18
Problems with PCFGs
(poor independence assumptions)

It’s still only a CFG, so dependencies on nonCFG info not captured



English e.g., Pronouns are more likely to be
subjects than objects:
Persian e.g. pro drop are very likely to be when it’s
subject
P[(NPPronoun) | NP=subj] >> P[(NPPronoun) | NP =obj]
19
Problems with PCFGs (Ignore
lexical information)

Ignores lexical information (statistics), which
is usually crucial for disambiguation


(T1) America sent [[250,000 soldiers] [into Iraq]]
(T2) America sent [250,000 soldiers] [into Iraq]


send with into-PP always attached high (T2) in PTB!
To handle lexical information, we’ll turn to
lexicalized PCFGs
20
Ignore lexical information
VP  VBD NP PP
VP  VBD NP
NP  NP PP
21
Ignore lexical information

Based on the probability of rules (VP  VBD NP PP)
and (VP  VBD NP  NP  NP PP) one of these
attachments (VP or NP) are accepted for every PP
independent of the lexical





NP attachment are more likely to be used but …
“fishermen caught tons of herring”
affinity between “dumps”, “into” is greater than
“sacks”, “into” => VP attachments
But affinity between “tons”, “of” is greater than
“caught”, “of” => NP attachments
Model Lexical dependency
22
Handling Coordination
ambiguity
23
Dealing with PCFG problems:
SPLITTING AND MERGING


For handling the problem of the fact
that NPs in subject position tend to be
pronoun: split the NP to NPsubject,
NPobject
parent annotation:split the nonterminal based on parent non-terminal


NP^VP : object
NP^S : subject
24
Parent annotation
25
Splitting the pre-terminals

“If” prefer sentential complements than
NP complements
26
Split and Merge


Node-splitting is not without problems; it
increases the size of the grammar, and hence
reduces the amount of training data available
for each grammar rule, leading to overfitting
Choosing the correct granularity is the key



Split and Merge iteratively
Start with simple grammar
alternately splits the non-terminals and merges
together non-terminals, finding the set of
annotated nodes which maximizes the likelihood
of the training set treebank
27
Probabilistic Lexicalized PCFGs





Remember how we talked about head information
being passed up in a syntactic analysis?
Well, if you follow this down all the way to the
bottom of a tree, you wind up with a head word
In some sense, we can say that Book that flight is
not just an S, but an S rooted in book
Thus, book is the headword of the whole sentence
By adding headword information to nonterminals, we
wind up with a lexicalized grammar
28
Probabilistic Lexicalized PCFGs

make a further extension to associate
the head tag (pos of head)
29
Incorporating Head
Probabilities: Wrong Way

Simply adding headword w to node won’t work:



The probabilities are too small, i.e., we don’t have a
big enough corpus to calculate these probabilities



So, the node A becomes A[w]
e.g., P(A[w]β|A) =Count(A[w]β)/Count(A)
VP(dumped)  VBD(dumped) NP(sacks) PP(into) 3x10-10
VP(dumped)  VBD(dumped) NP(cats) PP(into) 8x10-11
These probabilities are tiny, and others will never
occur
30
31
Lexicalized Grammars

Best Results until now,


Collins Parser
Charniak Parser
32
Incorporating head probabilities:
Right way (Collins Parser)

right-hand side of every (internal) CFG
rule as consisting of a head nonterminal, together with the nonterminals to the left of the head, and
the non-terminals to the right of the
head
33
Collins Parser

Instead of simple MLE probability break
down the rule with a generative story


given the left-hand side, we first generate
the head of the rule, and then generate
the dependents of the head, one by one,
from the inside out
Adding STOP in begin and end of RHS
of the rules
34
Generative Story
35
Collins Parser
36
Advanced Features on Collins

Distance feature
37
Evaluating Parser Output


Dependency relations are also useful for comparing
parser output to a treebank
Traditional measures of parser accuracy:


Cross bracket:
((A B) C) ? (A (B C))
38
DISCRIMINATIVE RERANKING


generative parsers: probabilistic model gives us the probability
of generating a particular sentence by assigning a probability to
each choice the parser
But generative parsing models also make it hard to incorporate
arbitrary kinds of information into the probability





Right-branching in English
Two-Stage Discriminative reranking
N-best list: output N-best parses instead of one single parse.
Extract features : parse probability assigned by the first-stage
statistical parser, each of the CFG rules in the tree, the number
of parallel conjuncts,
how heavy each constituent is, measures of how right-branching
the parse tree is, how many times various tree fragments occur,
bigrams of adjacent non-terminals in the tree
39
PARSER-BASED LANGUAGE
MODELING



Using Two-Stage modeling
First Stage: using a normal N-gram
grammar and output N-best output
using the n-gram modeling
Second Stage: we run our statistical
parser and assign a parse probability to
each sentence in the N-best list or
lattice.
40
Practice

14.5, 14.6
41