Transcript Parsing

Parsing
See:
R Garside, G Leech & A McEnery (eds) Corpus Annotation, London (1997): Longman,
chapters 11 (Bateman et al) and 12 (Garside & Rayson)
G Kennedy, An introduction to corpus linguistics, London (1998): Longman, pp. 231244.
CF Meyer, English corpus linguistics, Cambridge (2002): CUP, pp. 91-96.
R Mitkov (ed) The Oxford Handbook of Computational Linguistics, Oxford (2003):
OUP, chapter 4 (Kaplan)
J Allen Natural Language Understanding (2nd ed) (1994): Addison Wesley
Parsing
• POS tags give information about the
individual words, and their internal form
(eg sing vs plur, tense of verb)
• Additional level of information concerns
the way the words relate to each other
– the overall structure of each sentence
– the relationships between the words
• This can be achieved by parsing the
corpus
2
Parsing – overview
• What sort of information does parsing
add?
• What are the difficulties relating to
parsing?
• How is parsing done?
• Parsing and corpora
– partial parsing, chunking
– stochastic parsing
– treebanks
3
Structural information
• Parsing adds information about sentence
structure and constituents
• Allows us to see what constructions words enter
into
– eg, transitivity, passivization, argument structure for
verbs
• Allows us to see how words function relative to
each other
– eg, what words can modify / be modified by other
words
4
[S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD
grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1
Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$
new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S]
S
N
Fr
V
V
J
P
P
N
N
P
P
N
N
N
NP1 , AT NN1 NN1 , PNQS V
Nemo ,, the
killer whale
who
Nemo
the killer
whale ,, who
N
VVN RG JJ IF APP$ NN1 II
’d grown
toobig
big for
for his
his
‘d
grown too
NP1 NNL1 , VHZ VVN
RR II APP$ JJ NN1 II NP1
NN1 NNL1 .
pool
on Clacton
Clacton Pier
his new
new home
home inin Windsor
Windsor safari
park .
pool on
Pier ,, has
has arrived
arrived safely
safely at
at his
safari
5 park .
[S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD
grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1
Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$
new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S]
S
given this verb,
what kinds of things can be subject?
N
Fr
V
V
J
P
P
N
N
P
P
N
N
N
NP1 , AT NN1 NN1 , PNQS V
Nemo , the killer whale , who
N
VVN RG JJ IF APP$ NN1 II
’d grown too big for his
NP1 NNL1 , VHZ VVN
RR II APP$ JJ NN1 II NP1
NN1 NNL1 .
pool on Clacton Pier , has arrived safely at his new home in Windsor safari park .
6
[S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD
grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1
Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$
new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S]
S
N
verb with adjective complement:
what verbs can participate in this construction?
with what adjectives?
V
any other constraints?
Fr
V
J
P
P
N
N
P
P
N
N
N
NP1 , AT NN1 NN1 , PNQS V
Nemo , the killer whale , who
N
VVN RG JJ IF APP$ NN1 II
’d grown too big for his
NP1 NNL1 , VHZ VVN
RR II APP$ JJ NN1 II NP1
NN1 NNL1 .
pool on Clacton Pier , has arrived safely at his new home in Windsor safari park .
7
[S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD
grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1
Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$
new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S]
S
verb with PP complement:
what verbs with what prepositions?
any constraints on noun?
N
Fr
V
V
J
P
P
N
N
P
P
N
N
N
NP1 , AT NN1 NN1 , PNQS V
Nemo , the killer whale , who
N
VVN RG JJ IF APP$ NN1 II
’d grown too big for his
NP1 NNL1 , VHZ VVN
RR II APP$ JJ NN1 II NP1
NN1 NNL1 .
pool on Clacton Pier , has arrived safely at his new home in Windsor safari park .
8
Parsing: difficulties
• Besides lexical ambiguities (usually
resolved by tagger), language can be
structurally ambiguous
– global ambiguities due to ambiguous words
and/or alternative possible combinations
– local ambiguities, especially due to
attachment ambiguities, and other
combinatorial possibilities
– sheer weight of alternatives available in the
absence of (much) knowledge
9
Global ambiguities
• Individual words can be ambiguous as to
category
• In combination with each other this can
lead to ambiguity:
– Time flies like an arrow
– Gas pump prices rose last time oil stocks fell
10
Local ambiguities
• Structure of individual constituents may be
given, but how they fit together can be in doubt
• Classic example of PP attachment
– The man saw the girl with the telescope
The man saw the girl in the park with a statue of the general on a horse
with a sword on a stand with a red dress with a telescope in the morning
• Many other attachments potentially ambiguous
– relative clauses, adverbs, parentheticals, etc
11
Difficulties
• Broad coverage necessary for parsing
corpora of real text
• Long sentences:
– structures are very complex
– ambiguities proliferate
• Difficulty (even for human) to verify if
parse is correct
– because it is complex
– because it may be genuinely ambiguous
12
How to parse
• Traditionally (in linguistics)
– hand-written grammar
– usually narrow coverage
– linguists are interested in theoretical issues
regarding syntax
• Even in computational linguistics
– interest is (was?) in parsing algorithms
• In either case, grammars typically used
small set of categories (N, V, Adj etc)
13
Lack of knowledge
•
•
•
•
Humans are very good at disambiguating
In fact they rarely even notice the ambiguity
Usually, only one reading “makes sense”
They use a combination of
– linguistic knowledge
– common-sense (real-world) knowledge
– contextual knowledge
• Only the first is available to computers, and then
only in a limited way
14
Parsing corpora
• Using tagger as a front-end changes
things:
– Richer set of grammatical categories which
reflect some morphological information
– Hand-written grammars more difficult though
because many generalisations lost (eg now
need many more rules for NP)
– Disambiguation done by tagger in some
sense pre-empts work that you might have
expected the parser to do
15
Parsing corpora
• Impact of broad coverage requirement
– Broad coverage means that many more
constructions are covered by the grammar
– This increases ambiguity massively
• Partial parsing may be sufficient for some
needs
• Availability of corpora permits (and
encourages) stochastic approach
16
Partial parsing
• Identification of constituents (noun
phrases, verb groups, PPs) is often quite
robust …
• Only fitting them together can be difficult
• Although some information is lost,
identifying “chunks” can be useful
17
Stochastic parsing
• Like ordinary parsing, but
competing rules are
assigned a probability
score
• Scores can be used to
compare (and favour)
alternative parses
• Where do the probabilities
come from?
S  NP VP
.80
S  aux NP VP .15
S  VP
.05
NP  det n
.20
NP  det adj n .35
NP  n
.20
NP  adj n
.15
NP  pro
.10
18
Where do the probabilities come from?
1) Use a corpus of already parsed sentences: a
“treebank”
– Best known example is the Penn Treebank
•
•
•
Marcus et al. 1993
Available from Linguistic Data Consortium
Based on Brown corpus + 1m words of Wall Street Journal
+ Switchboard corpus
– Count all occurrences of each rule variation (e.g.
NP) and divide by total number of NP rules
– Very laborious, so of course is done automatically
19
Where do the probabilities come from?
2) Create your own treebank from your own
corpus
– Easy if all sentences are unambiguous: just
count the (successful) rule applications
– When there are ambiguities, rules which
contribute to the ambiguity have to be
counted separately and weighted
20
Where do the probabilities come from?
3) Learn them as you go along
– Again, assumes some way of identifying the
correct parse in case of ambiguity
– Each time a rule is successfully used, its
probability is adjusted
– You have to start with some estimated
probabilities, e.g. all equal
– Does need human intervention, otherwise
rules become self-fulfilling prophecies
21
Bootstrapping the grammar
• Start with a basic grammar, possibly written by hand,
with all rules equally probable
• Parse a small amount of text, then correct it manually
– this may involve correcting the trees and/or changing the
grammar
• Learn new probabilities from this small treebank
• Parse another (similar) amount of text, then correct it
manually
• Adjust the probabilities based on the old and new trees
combined
• Repeat until the grammar stabilizes
22
Treebanks – some examples
(with links)
•
Penn perhaps best known
– Wall Street Corpus, Brown Corpus; >1m words
•
•
International Corpus of English (ICE);
Lancaster Parsed Corpus and Lancaster-Leeds treebank
– parsed excerpts from LOB; 140k and 45k words resp.
•
Susanne Corpus, Christine Corpus, Lucy Corpus;
– related to Lancaster corpora; developed by Geoffrey Sampson
•
•
•
Verbmobil treebanks
– parallel treebanks (Eng, Ger, Jap) used in speech MT project
LinGO Redwoods: HPSG-based parsing of Verbmobil data
Multi-Treebank
– parses in various frameworks of 60 sentences
•
The PARC 700 Dependency Bank;
– LFG parses of 700 sentences also found in Penn treebank
•
CHILDES
– Brown Eve corpus of children’s speech samples with dependency annotation
23