ppt - CSE, IIT Bombay

Download Report

Transcript ppt - CSE, IIT Bombay

CS626-460: Language
Technology for the Web/Natural
Language Processing
Pushpak Bhattacharyya
CSE Dept.,
IIT Bombay
Introducing Syntax, Phrase Structure,
Basics of Parsing Algorithms
Introduction to Syntax
Syntax


Syntax is the study of the combination of
words into phrases, clauses and sentences.
Syntax describes how sentences and their
constituents are structured.
Where does it fit in the
scheme of things
Semantics
Syntax
Lexicon
Does syntax help?

“opens the key easily door the”




Vs
“flying planes can be dangerous”
“flying planes are dangerous”
“flying planes is dangerous”
Grammar

A finite set of rules
 that generates only and all sentences
of a language.
 that assigns an appropriate structural
description to each one.
Criteria of a “good” grammar



Observational Adequacy
Descriptive Adequacy
Explanatory Adequacy
Grammatical Analysis
Techniques

Two main devices
Breaking up a String



Sequential
Hierarchical
Transformational
Labeling the Constituents



Morphological
Categorial
Functional
Breaking up and Labeling


Sequential Breaking up
 Sequential Breaking up and Morphological
Labeling
 Sequential Breaking up and Categorial Labeling
 Sequential Breaking up and Functional Labeling
Hierarchical Breaking up
 Hierarchical Breaking up and Categorial Labeling
 Hierarchical Breaking up and Functional Labeling
Sequential Breaking up

That student solved the problems.
that + student + solve + ed + the + problem + s
Sequential Breaking up and Morphological
Labeling

That student solved the problems.
that student solve ed
word
word
stem
the
affix word
problem
s
stem
affix
Sequential Breaking up and
Categorial Labeling


This boy can solve the problem.
this
boy
can
solve
the
problem
Det
N
Aux
V
Det
N
They called her a taxi.
They
call
ed
her
a
taxi
Pron
V
Affix
Pron
Det
N
Sequential Breaking up and
Functional Labeling
They
called
her
Subject
Verbal
Direct
Object
They
called
her
Subject
Verbal
Indirect
Object
a
taxi
Indirect
Object
a
taxi
Direct
Object
Hierarchical Breaking up
Old
men and women
Old
Old men and women
men and women
men
and
women
Old men
Old
men
and
women
Hierarchical Breaking up and
Categorial Labeling

Poor John ran Saway.
NP
A
Poor
VP
N
John
V
Adv
ran
away
Hierarchical Breaking up and
Functional Labeling


Immediate Constituent (IC) Analysis
Construction types in terms of the function
of the constituents:





Predication (subject + predicate)
Modification (modifier + head)
Complementation (verbal + complement)
Subordination (subordinator + dependent
unit)
Coordination (independent unit + coordinator)
Predication

[Birds]subject [fly]predicate
S
Subject
Birds
Predicate
fly
Modification


[A]modifier [flower]head
John [slept]head [in the room]modifier
S
Subject
John
Predicate
Head
slept
Modifier
In the room
Complementation

He
[saw]verbal
S
Subject
He
[a lake]complement
Predicate
Verbal
Complement
saw
a lake
Subordination

John slept [in]subordinator [the room]dependent unit
S
Subject
John
Predicate
Head
slept
Modifier
Subordinator
in
Dependent Unit
the room
Coordination

[John came in time]
ready] independent unit
independent unit
[but]coordinator
[Mary was not
S
Independent Unit
Coordinator
Independent Unit
John came in time
but
Mary was not ready
An Example

In the morning, the sky looked much brighter.
S
Modifier
Subordinator
Modifier
Head
Subject
DU
Head
Modifier
Predicate
Head
Verbal
Complement
Modifier
In
the
morning,
the
sky
looked
much
Head
brighter
Hierarchical Breaking up and
Categorial / Functional Labeling



Hierarchical Breaking up coupled with Categorial
/Functional Labeling is a very powerful device.
But there are ambiguities which demand something
more powerful.
E.g., Love of God
 Someone loves God
 God loves someone
Hierarchical Breaking up
Categorial Labeling
Functional Labeling
Love of God
Noun
Phrase
love
Love of God
Prepositional
Phrase
of
God
Head
love
Modifier
Sub
DU
of
God
Types of Generative Grammar



Finite State Model
(sequential)
Phrase Structure Model
(sequential + hierarchical) + (categorial)
Transformational Model
(sequential + hierarchical + transformational) +
(categorial + functional)
Phrase Structure Grammar (PSG)
A phrase-structure grammar G consists of a four tuple
(V, T, S, P), where




V is a finite set of alphabets (or vocabulary)
 E.g., N, V, A, Adv, P, NP, VP, AP, AdvP, PP, student,
sing, etc.
T is a finite set of terminal symbols: T  V
 E.g., student, sing, etc.
S is a distinguished non-terminal symbol, also called
start symbol: S  V
P is a set of productions.
Noun Phrases

John

NP
the student

the intelligent
student
NP
NP
N
Det
N
John
the
student
Det
AdjP
N
the intelligent student
Noun Phrase

his first five PhD students
NP
Det
Ord
Quant
N
his
first
five
PhD
N
students
Noun Phrase

The five best students of my class
NP
Det
Quant
the
five
AP
N
best students
PP
of my class
Verb Phrases

can sing

can hit the ball
VP
VP
Aux
V
Aux
V
NP
can
sing
can
hit
the ball
Verb Phrase

Can give a flower to Mary
VP
Aux
can
V
NP
give a flower
PP
to Mary
Verb Phrase

may make John the chairman
VP
Aux
may
V
NP
make John
NP
the chairman
Verb Phrase

may find the book very interesting
VP
Aux
V
NP
may
find
the book
AP
very interesting
Prepositional Phrases

in the classroom

near the
river
PP
PP
P
NP
P
NP
in
the classroom
near
the river
Adjective Phrases

intelligent

very
honest

fond of
sweets
AP
AP
A
Degree
intelligent
very
AP
A
A
honest fond
PP
of sweets
Adjective Phrase
• very worried that she might have done badly in the
assignment
AP
Degree
very
S’
A
worried
that she might have done badly in the
assignment
Phrase Structure Rules

The boy hit the ball.


Rewrite
(i)
(ii)
(iii)
Rules:
S

NP VP
NP

Det N
VP

V NP
(iv)
Det

the
(v)
N

man, ball
(v)
V

hit
We interpret each rule X  Y as the instruction
rewrite X as Y.
Derivation


The boy hit the ball.
Sentence
NP + VP
Det + N + VP
Det + N + V + NP
The + N + V + NP
The + boy + V + NP
The + boy + hit + NP
The + boy + hit + Det + N
The + boy + hit + the + N
The + boy + hit + the + ball
(i)
(ii)
(iii)
(iv)
(v)
(vi)
(ii)
(iv)
(v)
PSG Parse Tree

The boy hit the ball.
S
NP
VP
Det
N
the
boy
V
hit
NP
Det
N
the
ball
PSG Parse Tree

John wrote those words in the Book of Proverbs.
S
NP
PropN
VP
V
PP
NP
NP
P
John wrote
NP
those
words
PP
in
the
book
of
proverbs
Penn POS Tags

John wrote those words in the Book of
Proverbs.
[John/NNP ]
wrote/VBD
[ those/DT words/NNS ]
in/IN
[ the/DT Book/NN ]
of/IN
[ Proverbs/NNS ]
Penn Treebank

John wrote those words in the Book of
Proverbs.
(S (NP-SBJ (NP John))
(VP wrote
(NP those words)
(PP-LOC in
(NP (NP-TTL (NP the Book)
(PP of
(NP Proverbs)))
PSG Parse Tree

Official trading in the shares will start in
Paris on Nov 6.
S
NP
VP
NP
PP
Aux
AP
N
P
V
PP
PP
NP
A
official trading in
the shares will
start in Paris
on Nov 6
Penn POS Tags

Official trading in the shares will start in
Paris on Nov 6.
[ Official/JJ trading/NN ]
in/IN
[ the/DT shares/NNS ]
will/MD start/VB in/IN
[ Paris/NNP ]
on/IN
[ Nov./NNP 6/CD ]
Penn Treebank

Official trading in the shares will start in
Paris on Nov 6.
( (S (NP-SBJ (NP Official trading)
(PP in
(NP the shares)))
(VP will
(VP start
(PP-LOC in
(NP Paris))
(PP-TMP on
(NP (NP Nov 6)
Penn POS Tag Sset
















Adjective:
Adverb:
Cardinal Number:
Determiner:
Preposition:
Coordinating Conjunction
Subordinating Conjunction:
Singular Noun:
Plural Noun:
Personal Pronoun:
Proper Noun:
Verb base form:
Modal verb:
Verb (3sg Pres):
Wh-determiner:
Wh-pronoun:
JJ
RB
CD
DT
IN
CC
IN
NN
NNS
PP
NP
VB
MD
VBZ
WDT
WP
Basic Parsing Strategy
A Fragment of English
Grammar
S
 NP VP
VP  V NP
NP  NNP | ART N
NNP  Ram
V
 ate | saw
ART  a | an | the
N
 rice | apple |
movie
Derivation
• S is a special symbol called start symbol.
Multiple
Choice
Points
S => NP VP
=>
=>
=>
=>
=>
=>
=>
(rewrite S)
NNP VP
(rewrite NP)
Ram VP
(rewrite NNP)
Ram V NP
(rewrite VP)
Ram ate NP (rewrite V)
Ram ate ART N
(rewrite NP)
Ram ate the N
(rewrite ART)
Ram ate the rice
(rewrite N)
Two Strategies : Top-Down & Bottom-Up



Top down : Start with S and generate
the sentence.
Bottom up : Start with the words in
the sentence and use the rewrite
rules backwards to reduce the
sequence of symbols to produce S.
Previous slide showed top-down
strategy.
Bottom-Up Derivation
=>
=>
=>
=>
=>
=>
=>
=>
Ram ate the rice
NNP ate the rice
(rewrite
NNP V the rice
(rewrite
NNP V ART rice
(rewrite
NNP V ART N (rewrite rice)
NP V ART N (rewrite NNP)
NP V NP
(rewrite
NP VP
(rewrite V NP)
S
Ram)
ate)
the)
ART N)
Parsing Algorithm
A procedure that
“searches” through the grammatical rules
to find a combination that
generates a tree which
stands for the structure of the sentence
Parsing as Search

State Space : AND–OR Graph
S
NP
NNP
ART
VP
N
V
NP
NNP ART N

The leaves have links to words in the language,
e.g.,
ART

Top-Down Parsing

DFS on the AND-OR graph

Data structures:

Open List (OL): Nodes to be expanded
Closed List (CL): Expanded Nodes
Input List (IL): Words of sentence to be

Moving Head (MH): Walks over the IL


parsed
Trace
Top-Down
Parsing
Initialof
Condition
(T )
0
OL
S
CL (empty)
Ram ate the rice
MH
IL
Trace
T : of Top-Down Parsing
1
NP
S
Ram ate the rice
MH
OL
VP
CL
IL
Trace
T : of Top-Down Parsing
2
OL
NNP ART N VP
S NP
Ram ate the rice
MH
CL
IL
Trace
T : of Top-Down Parsing
3
OL
ART N VP
S NP NNP
Ram ate the rice
CL
IL
MH (portion of Input consumed)
Trace
T : of Top-Down Parsing
4
N
OL
VP
S NP NNP ART*
Ram ate the rice
MH
(* indicates ‘useless’ expansion)
CL
IL
Trace
T : of Top-Down Parsing
5
OL
VP
S NP NNP ART* N*
Ram ate the rice
MH
CL
IL
Trace
T : of Top-Down Parsing
6
OL
V NP
S NP NNP ART* N*
Ram ate the rice
MH
CL
IL
Trace
T : of Top-Down Parsing
7
OL
NP
S NP NNP ART* N* V
Ram ate the rice
MH
CL
IL
Trace
T : of Top-Down Parsing
8
OL
NNP ART N
S NP NNP ART* N* V NP
Ram ate the rice
MH
CL
IL
Trace
T : of Top-Down Parsing
9
ART
S NP NNP ART* N* V NNP*
Ram ate the rice
MH
OL
N
CL
IL
Trace
T : of Top-Down Parsing
10
OL
N
S NP NNP ART* N* V NNP ART
Ram ate the rice
MH
CL
IL
Trace
T : of Top-Down Parsing
11
OL
S NP NNP ART* N* V NNP ART N
Ram ate the rice
MH
CL
IL
Successful Termination: OL empty AND MH at the
end of IL.
Bottom-Up Parsing
Basic idea:



Refer to words from the
lexicon.
Obtain all POSs for each
word.
Keep combining until S is
obtained.
Example of Bottom-Up Parsing
Example from Jurafsky & Martin, 2000 :
 Sentence: Book the flight
 Grammar:
S
 NP VP | VP
NP  ART N | NNP
VP  V | V NP
ART  a | an | the
N
 book | flight | meal
V
 book | include
Trace of Bottom-Up
Parsing
book the flight
N
book
ART
the
NP
N
ART N
(discarded)
N
flight
V
book
V
ART
the
N
flight
VP
ART N
NP
VP ART N
(discarded)
NP
V ART N
VP
V
NP
S
VP
(success)
Implementation of Bottom-up
Parsing



Through a stack
Push words into the stack
Look for a “handle” to reduce to a nonterminal
(Aho, Sethi, Ullman 1986)

Termination by “start symbol on stack” and
“end of sentence”.
Trace of Bottom-Up Parsing
T0
book the flight
MH
Trace of Bottom-Up Parsing

Push ‘book’; advance input pointer
book the flight
MH
book
Trace of Bottom-Up Parsing

Reduce ‘book’
book the flight
MH
V
Trace of Bottom-Up Parsing

Push ‘the’; advance input pointer
book the flight
MH
the
V
Trace of Bottom-Up Parsing

Reduce ‘the’
book the flight
MH
ART
V
Trace of Bottom-Up Parsing

Push ‘flight’; advance pointer
book the flight
flight
ART
V
MH
Trace of Bottom-Up Parsing

Reduce ‘flight’
book the flight
N
ART
V
MH
Trace of Bottom-Up Parsing

Reduce ‘ART N’ by ‘NP’
book the flight
MH
NP
V
Trace of Bottom-Up Parsing

Reduce ‘V NP’ by ‘S’; termination by S
on stack and input exhausted.
book the flight
MH
S
Efficiency Issues


To reuse the work already done for constituent subtrees.
Example of inefficiency: Consider the sentence
the train from Mumbai to Kolkata via Nagpur

Grammar: S
 NP
NP  NP PP | ART N| NNP
ART  the
N
 train
NNP  Mumbai | Kolkata | Nagpur
PP  P NP
P
 from | to | via
Possible False Steps
the train from Mumbai to
Kolkata via Nagpur
push “the”; reduce to “ART”;
push “train”; reduce to “N”;
reduce “ART N” to “NP”;
reduce “NP” to “S”.
A

Possible
Perform
A, and False
then
Steps
from Mumbai to Kolkata via Nagpur
NP
push “from”; reduce to “P”;
push “Mumbai”; reduce to “NNP”;
reduce to “NP”; reduce “P NP” to “PP”;
Reduce “NP PP” to “NP”;
reduce “NP” to “S”.
B
Possible False Steps



Similarly for
“……….. to Kolkata via Nagpur”
and
“…….. via Nagpur”
Shift reduce conflicts occur for
S  NP
NP  NP PP
Should “NP” be reduced to “S” or should one search for “PP”
and then get bigger “NP”?
Reduplication
of
Work
the train
NP
S
from Mumbai
PP
S
NP
to Kolkata
PP
NP
NP
S
via Nagpur
PP
S
# of Repetitions for Subtree Computation







the train
from Mumbai
the train from Mumbai
to Kolkata
the train from Mumbai to Kolkata
via Nagpur
the train from Mumbai to Kolkata via
4 times
3 times
3 times
2 times
2 times
1 time
Nagpur 1 time
Can the “subtrees already computed” be reused?
Chart Parsing : Earley Algorithm
(Dynamic Programming based)


Sentence: book the flight
Grammar:
S
 NP VP | VP
NP  ART N | NNP
VP  V | V NP
ART  a | an | the
N
 book | flight
V
 book | include
Definitions


CHART is the data structure that stores
the record of matched constituents and
expected constituents through the use
of dotted rules.
A dotted rule is of the form
AB C
where B is the matched constituent
and C is the
Definitions



PREDICTOR is the procedure that
records by transitive closure the set of
dotted rules for a given state of the
input processing.
SCANNER is the procedure that
consumes the next input token.
COMPLETER is the procedure that

takes a dotted rule for which the dot is at
the rightmost end and
Illustration of the Algorithm
Example : “0book1 the2 flight3”
Chart[0]
State Dotted Rule
Position Comment
S0 S’  S
[0,0] Dummy
start state
S1 S
 NP VP
[0,0] Predictor
S2 NP  ART N
[0,0] Predictor
S3 NP  NNP
[0,0] Predictor
S5 S
 VP
[0,0] Predictor
S6 VP  V
[0,0] Predictor
S7 VP  V NP
[0,0] Predictor
S8
Chart[1]
Example “0book
1 the2 flight3”
V  book
[0,1]
Scanner
(consume in token “book”)
S9 VP  V
[0,1]
Completer
S10 S  VP
[0,1]
Completer
(complete waiting constituents)
S11 S’  S
[0,1]
Completer
(but not termination!)
S11 VP  V NP
S12 NP  ART N
chart)
[0,1]
[1,1]
Completer
Predictor
(new waiting constituents come into the
Example “0book1 the2 flight3”
Chart[2]
S14 ART  the
Scanner
S15 NP  ART N
Completer
[1,2]
[1,2]
Example “0book1 the2 flight3”
Chart[3]
S15 N  flight
S16 NP  ART N
S17 VP  Verb NP
S18 S  VP
S19 S’  S
termination
[2,3]
Scanner
[1,3]
Completer
[0,3]
Completer
[0,3]
Completer
[0,3] Successful