ppt - CSE, IIT Bombay

Download Report

Transcript ppt - CSE, IIT Bombay

CS626-460: Language
Technology for the Web/Natural
Language Processing
Pushpak Bhattacharyya
CSE Dept.,
IIT Bombay
Constituent Parsing and Algorithms
(with major contributions from Dr. Rajat
Mohanty)
Syntax


Syntax is the study of the combination of
words into phrases, clauses and sentences.
Syntax describes how sentences and their
constituents are structured.
Grammar

A finite set of rules
 that generates only and all sentences
of a language.
 that assigns an appropriate structural
description to each one.
Grammatical Analysis
Techniques

Two main devices
Breaking up a String



Sequential
Hierarchical
Transformational
Labeling the Constituents



Morphological
Categorial
Functional
Hierarchical Breaking up and
Categorial Labeling

Poor John ran Saway.
NP
A
Poor
VP
N
John
V
Adv
ran
away
Hierarchical Breaking up and
Functional Labeling


Immediate Constituent (IC) Analysis
Construction types in terms of the function
of the constituents:





Predication (subject + predicate)
Modification (modifier + head)
Complementation (verbal + complement)
Subordination (subordinator + dependent
unit)
Coordination (independent unit + coordinator)
An Example

In the morning, the sky looked much brighter.
S
Modifier
Subordinator
Modifier
Head
Subject
DU
Head
Modifier
Predicate
Head
Verbal
Complement
Modifier
In
the
morning,
the
sky
looked
much
Head
brighter
Noun Phrases

John

NP
the student

the intelligent
student
NP
NP
N
Det
N
John
the
student
Det
AdjP
N
the intelligent student
Phrases
Noun Phrase

his first five PhD students
NP
Det
Ord
Quant
N
his
first
five
PhD
N
students
Noun Phrase

The five best students of my class
NP
Det
Quant
the
five
AP
N
best students
PP
of my class
Verb Phrases

can sing

can hit the ball
VP
VP
Aux
V
Aux
V
NP
can
sing
can
hit
the ball
Verb Phrase

Can give a flower to Mary
VP
Aux
can
V
NP
give a flower
PP
to Mary
Verb Phrase

may make John the chairman
VP
Aux
may
V
NP
make John
NP
the chairman
Verb Phrase

may find the book very interesting
VP
Aux
V
NP
may
find
the book
AP
very interesting
Prepositional Phrases

in the classroom

near the
river
PP
PP
P
NP
P
NP
in
the classroom
near
the river
Adjective Phrases

intelligent

very
honest

fond of
sweets
AP
AP
A
Degree
intelligent
very
AP
A
A
honest fond
PP
of sweets
Adjective Phrase
• very worried that she might have done badly in the
assignment
AP
Degree
very
S’
A
worried
that she might have done badly in the
assignment
A segment of English
Grammar






S’(C) S
S{NP/S’} VP
VP(AP+) V (AP+) ({NP/S’}) (AP+)
(PP+) (AP+)
NP(D) (AP+) N (PP+)
PPP NP
AP(AP) A
PSG Parse Tree

John wrote those words in the Book of Proverbs.
S
NP
PropN
VP
V
PP
NP
NP
P
John wrote
NP
those
words
PP
in
the
book
of
proverbs
Penn Treebank

John wrote those words in the Book of
Proverbs.
(S (NP-SBJ (NP John))
(VP wrote
(NP those words)
(PP-LOC in
(NP (NP-TTL (NP the Book)
(PP of
(NP Proverbs)))
PSG Parse Tree

Official trading in the shares will start in
Paris on Nov 6.
S
NP
VP
NP
PP
Aux
AP
N
P
V
PP
PP
NP
A
official trading in
the shares will
start in Paris
on Nov 6
Penn POS Tags

Official trading in the shares will start in
Paris on Nov 6.
[ Official/JJ trading/NN ]
in/IN
[ the/DT shares/NNS ]
will/MD start/VB in/IN
[ Paris/NNP ]
on/IN
[ Nov./NNP 6/CD ]
Penn Treebank

Official trading in the shares will start in
Paris on Nov 6.
( (S (NP-SBJ (NP Official trading)
(PP in
(NP the shares)))
(VP will
(VP start
(PP-LOC in
(NP Paris))
(PP-TMP on
(NP (NP Nov 6)
Penn POS Tag Sset
















Adjective:
Adverb:
Cardinal Number:
Determiner:
Preposition:
Coordinating Conjunction
Subordinating Conjunction:
Singular Noun:
Plural Noun:
Personal Pronoun:
Proper Noun:
Verb base form:
Modal verb:
Verb (3sg Pres):
Wh-determiner:
Wh-pronoun:
JJ
RB
CD
DT
IN
CC
IN
NN
NNS
PP
NP
VB
MD
VBZ
WDT
WP
Basic Parsing Strategy
A Fragment of English
Grammar
S
 NP VP
VP  V NP
NP  NNP | ART N
NNP  Ram
V
 ate | saw
ART  a | an | the
N
 rice | apple |
movie
Derivation
• S is a special symbol called start symbol.
Multiple
Choice
Points
S => NP VP
=>
=>
=>
=>
=>
=>
=>
(rewrite S)
NNP VP
(rewrite NP)
Ram VP
(rewrite NNP)
Ram V NP
(rewrite VP)
Ram ate NP (rewrite V)
Ram ate ART N
(rewrite NP)
Ram ate the N
(rewrite ART)
Ram ate the rice
(rewrite N)
Two Strategies : Top-Down &
Bottom-Up



Top down : Start with S and generate
the sentence.
Bottom up : Start with the words in
the sentence and use the rewrite
rules backwards to reduce the
sequence of symbols to produce S.
Previous slide showed top-down
strategy.
Bottom-Up Derivation
=>
=>
=>
=>
=>
=>
=>
=>
Ram ate the rice
NNP ate the rice
(rewrite
NNP V the rice
(rewrite
NNP V ART rice
(rewrite
NNP V ART N (rewrite rice)
NP V ART N (rewrite NNP)
NP V NP
(rewrite
NP VP
(rewrite V NP)
S
Ram)
ate)
the)
ART N)
Parsing Algorithm
A procedure that
“searches” through the grammatical rules
to find a combination that
generates a tree which
stands for the structure of the sentence
Top-Down Parsing (using A*)

DFS on the AND-OR graph

Data structures:

Open List (OL): Nodes to be expanded
Closed List (CL): Expanded Nodes
Input List (IL): Words of sentence to be

Moving Head (MH): Walks over the IL


parsed
Trace of Top-Down Parsing
Initial Condition (T0)
OL
S
CL (empty)
Ram ate the rice
MH
IL
Trace of Top-Down Parsing
T1:
NP
S
Ram ate the rice
MH
OL
VP
CL
IL
Trace of Top-Down Parsing
T2:
OL
NNP ART N VP
S NP
Ram ate the rice
MH
CL
IL
Trace of Top-Down Parsing
T3:
OL
ART N VP
S NP NNP
Ram ate the rice
CL
IL
MH (portion of Input consumed)
Trace of Top-Down Parsing
T4:
N
OL
VP
S NP NNP ART*
Ram ate the rice
MH
(* indicates ‘useless’ expansion)
CL
IL
Trace of Top-Down Parsing
T5:
OL
VP
S NP NNP ART* N*
Ram ate the rice
MH
CL
IL
Trace of Top-Down Parsing
T6:
OL
V NP
S NP NNP ART* N*
Ram ate the rice
MH
CL
IL
Trace of Top-Down Parsing
T7:
OL
NP
S NP NNP ART* N* V
Ram ate the rice
MH
CL
IL
Trace of Top-Down Parsing
T8:
OL
NNP ART N
S NP NNP ART* N* V NP
Ram ate the rice
MH
CL
IL
Trace of Top-Down Parsing
T9:
ART
S NP NNP ART* N* V NNP*
Ram ate the rice
MH
OL
N
CL
IL
Trace of Top-Down Parsing
T10:
OL
N
S NP NNP ART* N* V NNP ART
Ram ate the rice
MH
CL
IL
Trace of Top-Down Parsing
T11:
OL
S NP NNP ART* N* V NNP ART N
Ram ate the rice
MH
CL
IL
Successful Termination: OL empty AND MH at the
end of IL.
Bottom-Up Parsing
Basic idea:
 Refer to words from the
lexicon.
 Obtain all POSs for each
word.
 Keep combining until S is
obtained. (to be continued)