Lecture: Syntax and Parsing 1

Download Report

Transcript Lecture: Syntax and Parsing 1

Syntax and Processing it
(slide set 1)
John Barnden
School of Computer Science
University of Birmingham
Natural Language Processing 1
2015/16 Semester 2
Our Concerns
• The general nature of syntax (grammatical structure of sentences)
• Its significance
• Context-free grammars
S  NP VP
• Syntactic structure of a sentence, e.g.:
NP
etc. etc.
S
VP
• Parsing: using a grammar to compute a syntax (syntactic structure)
for a given sentence.
Overview
• The motivation for the idea of syntax
• Review of some types of phrase (syntactic categories)
• (Review of) notion of context-free grammar
• How separate/separable is syntax from other aspects of language?
• Methods for syntactic analysis:
– Active chart parsing (REQUIRED, covered in lectures)
– Dependency Parsing (REQUIRED, covered in lectures)
– Definite Clause Grammars [as an extension to Prolog] [OPTIONAL, not in lecs]
– Recursive transition networks [OPTIONAL, not in lectures]
Motivation
• Sentences seem to be able to be correctly or incorrectly structured in a way that
seems largely independent of the meaning or the plausibility of the meaning.
• The following seem correctly structured, even when odd semantically:
– The monkey ate the banana.
– The banana ate the monkey.
– The car hit the garage.
– The garage hit the car.
And moreover they seem to have the same structure.
• The following seem incorrectly structured:
– Banana the ate monkey the.
– The this car in was garage.
And we notice that incorrectness irrespective of considering meaning, even if we
do guess a possibly intended meaning.
Motivation, contd
• At the same time, what particular structure we see in a sentence affects meaning in
important ways.
• Classic example: The girl saw the woman with the telescope in the park.
• Some possible roughly-described structures and interpretations:
– [The girl] saw [the woman] [with the telescope] [in the park]
The seeing used the telescope and happened in the park
– [The girl] saw [the woman] [with [the telescope in the park]]
The seeing used the telescope that was in the park
– The girl [saw [the woman with the telescope] [in the park]]
The seeing happened in the park; the woman had a telescope
– The girl [saw [the woman [with [the telescope in the park]]]]
The woman was with the telescope that was in the park.
– Any others?
Motivation, contd
• We also see that phrases of certain sorts can be moved around in certain ways.
E.g., we can do this to
The girl saw the woman with the telescope in the park.
to get
The woman in the park with the telescope saw the girl.
• But we cannot move a phrase just anywhere and we cannot move just any phrase
around: e.g. cannot change the original to get
The with the telescope girl saw the woman in the park.
The girl saw the telescope the woman with in the park.
Motivation, contd
• Recursive structuring: phrases can be seen to have other phrases as constituents:
• We can see
The woman with the telescope in the park eating sandwiches
as
[The woman [with [the telescope [in the park]]] [eating sandwiches]]
• There is no in-principle limit to the depth of structuring or to its breadth.
• Hence there is no in principle limit to sentence length.
Motivation, contd
• Some sorts of constituent of a sentence generally require the presence (in suitable
positions) of other constituents of particular types. E.g.
A preposition needs to be followed by a specific sort of phrase
A verb may need to have surrounding constituents playing the role of subject,
direct object, etc.
• Any phrase of an appropriate general type, no matter how internally complex the
phrase is, can, typically, play a given role.
Motivation, contd: a Caveat
English grammar is very much based on contiguity (adjacency) of lexical forms and
their order in the sentence.
This is partly because it has very little in the way of agreement between words (e.g.
a plural subject requiring a plural form of the verb) and very little in the way of case
endings (e.g. nominative versus accusative forms of a pronoun) or verb inflections.
But languages with more pervasive use of agreement, case endings, verb
inflections, etc. can have widely separated words that form units. E.g., an
adjective and a noun can be associated with each other by agreement rather than
contiguity.
English arguably has examples of this: notably verb-particle phrases such as “take
up” meaning “adopt”. The verb and the particle are often separated, as in “He took
that marvellous plan of Julie’s up”. But they arguably form a syntactic unit.
We can even imagine a language where order and adjacency are completely
irrelevant.
a Caveat, contd.
So, in general, talk of “phrases” and “constituents” needs to be taken as
encompassing non-contiguous sets of lexical forms, even though in the following
they will always be contiguous sequences.
Motivation: Summary
•
Language has constituent structure: the expressions of the language can be structured into
constituents of particular types, where a constituent of a particular type can itself often
have, or may need to have, sub-constituents of particular types.
•
Language has syntactic compositionality: there are composition rules that, whenever they
are given constituents of particular types, produce constituents (or whole sentences) of
particular types, where the resulting types are dependent only on the rule and the
constituents inputed to the rule.
The compositionality is linked to the constituent structure: to at least a first approximation,
the resulting constituent contains the original constituents as parts.
E.g. Putting a determiner and a noun together to get a (simple sort of) “noun phrase”; putting a
preposition together with a noun phrase to get a “prepositional phrase”; putting a noun phrase and
a prepositional phrase together to get a noun phrase.
•
So, the constituent structure is systematic by virtue of the compositionality.
•
Through the syntactic compositionality, language is (in principle) productive: the
composition rules can be applied successively to any extent, leading to infinitely many
possible constituents and sentences.
•
And constituents/sentences can be of any structural depth and length, in principle.
Performance versus Competence
•
The ability freely to apply a grammar when speaking or understanding is often taken to be a
sort of competence.
•
But often practical considerations place limits on what one can do: e.g. what amount of
complexity one can tolerate in a sentence.
– Notice the in principle bits on previous slide.
•
And one can make mistakes.
•
These perturbations of competence are labelled as being matters of performance.
•
But some performance matters can be built into grammars, so it’s ultimately rather arbitrary
which effects are relegated to “performance.”
– E.g., a grammar could allow for a common mistake, or could restrict the depth of embedding.
•
Performance also encompasses graded effects like relative perceived naturalness of
structures or structure ordering in specific cases. E.g.: although “He gave the boy a pencil”
and “He gave a pencil to the boy” are both equally OK, consider:
– He gave the crying boy who had spots all over his face and a hole in his shirt a pencil.
– He gave a pencil to the crying boy who had spots all over his face and a hole in his shirt.
Syntactic Categories
•
Syntactic Categories: the categories of constituent: e.g. Sentence (S), Noun Phrase (NP),
Verb Phrase (VP), Prepositional Phrase (PP), etc.,
and I include all the lexical categories (Noun, Adjective, etc.).
•
Sentence (S): Phrases in syntactic category S are the largest lexical-form sequences that are
subject to syntactic characterization, usually.
– But people have proposed grammar-like structuring of discourse more broadly.
A special case of this are patterns of “turn taking” in conversation.
Of course, can trivially go beyond sentences, as e.g. defining a paragraph to be a sequence of
sentences.
– A sentence can in principle contain a smaller sentence as a constituent, although a grammar often
classifies the smaller sentence as a “clause” instead.
A clause is a sentence-like constituent; often can’t stand as a sentence in its own right – e.g.
Relative clause “that is in the park” following “the telescope”.
•
S needs to cover declarative (= statement-making), interrogative, and imperative sentences.
Syntactic Categories: S contd
•
To cope with language in practice, we have to allow S to be broader than normally thought
of, or allow a discourse to include complete utterances that are non-sentences. E.g.,
consider the following complete utterances:
– Yes. / Hallo. / Hmm? / Oh, damn. / How about it.
– Brilliant. / Or tomorrow, for that matter. / A beer, please. / You silly idiot.
[Could be considered to be cases of ellipsis.]
– Abbreviated language of various other sorts, such as:
 the expression in square brackets just above
 newspaper headlines
 in road signs
Syntactic Categories: General Problems
•
Ellipsis more generally needs to be taken care of: either within the definition of S and other
syntactic categories, or as a special side-issue that can distort constituents. E.g.
– John wants to have snail and seaweed ice-cream on his cake, but Mary just seaweed.
• Some idioms or other fixed phrases, and some special constructions, have strange
syntax:
– By and large, ... / By the by, ...
– The more I walk the more relaxed I get. [Can also be just part of a larger unit.]
• Punctuation: typically not handled much or at all in a grammar, partly because too
inconsistent and too style-dependent.
• Direct speech/thought is often not quoted, leading to special structuring, e.g.:
– No! I’m not up for that, he said/thought.
Syntactic Categories: Noun Phrase (NP)
• An NP is a constituent that can act like a noun.
• Examples [with illustrative context in square brackets]:
– any noun, any proper name
– any non-adjectival pronoun (I, him, these [when pronoun], which [when pronoun], ...)
– the horse / that horse / my horse / John’s horse
– three huge red cars
– one hundred and three huge, very red cars
– the horse with a long tail and a glossy mane
(includes a PP)
– the horse that had a long tail and a glossy mane
(includes a relative clause)
– the man called Horse [lived with the Algonquins]
– the baking
(gerund)
– baking a cake [is a tricky business]
(gerund)
– which horse [do you want to ride?]
– for John to shout at a policeman [is unusual]
(for-complement used as noun)
– to shout at a policeman [is to run a huge risk]
(to-complement used as noun)
– that John shouted at a policeman [amazes me]
(that-complement used as noun)
– the Riviera firefighter arson scandal
(inlcudes noun-noun compound)
Syntactic Categories: Prepositional Phrase (PP)
•
a PP is a preposition followed by an NP
– on the table
– with metal legs
•
The preposition may be more than one word (e.g. In: it can cost up to five pounds).
•
The NP can itself include a PP (e.g.: on the table with the metal legs ).
•
In that example, “with the metal legs” serves an adjectival function:
– it’s a form of adjectival phrase
•
He broke the table on Saturday: the PP serves an adverbial function:
– it’s a form of adverbial phrase
• Should “in back of”, “by means of”, “with respect to” etc. be counted as multiword prepositions? Or should “back”, “means” and “respect” be treated as
separate nouns?
Syntactic Categories: Verb Phrase (VP)
•
a VP is a verb possibly preceded by auxiliary verbs and possibly followed by (non-subject)
complements and adjuncts
– washed
– washed the dishes
– wash the dishes
(could be a command)
– do wash the dishes
– don’t / do not wash the dishes
– [we] wash the dishes
– was putting the cloth on the table at 5pm
“the cloth” and “on the table” are complements of the verb: necessary parameter values
“at 5pm” is an adjunct: an additional, optional qualification
– believed that John was laying the table
“that John was laying the table” is a that-complement, containing a clause/sentence
wanted to lay the table / Mary wanted John to lay the table:
here we have a to-complement, containing a VP or a clause (a sentence-like phrase)
Syntactic Categories: Relative Clauses
•
A relative clause is a clause with an adjectival function, hence a form of adjectival clause.
•
A restrictive relative clause: a clause that has an adjectival function within a noun phrase, further
restricting what is being talked about:
The goat that/which had long ears [ate my shoelaces]
The goat what/wot ’ad long ears [ate my shoelaces]
(incorrect but very possible!)
The man that/who had long ears [ate my lunch].
The man with whom I had lunch [had long ears].
The man I had lunch with [had long ears]
The man that/who I had lunch with [had long ears]
•
A non-restrictive relative clause: a clause that is in apposition with (= accompanies) a noun phrase,
and adds extra information to something already specified by the noun-phrase:
The goat, which had long ears, [ate my shoelaces]
The man, with whom I had lunch by the way, [had long ears]
The man, who I did have lunch with by the way, [had long ears]
•
Restrictive case: no commas.
Non-restrictive case: commas.
Syntactic Categories: Relative Clauses, etc.
•
The goat really did eat my shoelaces. [Needed for exam  ]
•
[Mary knows] who laid the table: a who-complement, with the same internal syntax as a
relative clause.
•
Apposition is a broader phenomenon than non-restrictive relative clauses:
Theodosia Kirkbride, a most refined landlady, never mentioned the rent to me.
Theodosia Kirkbride – landlady, mud wrestler and quantum physicist – ate my lunch.
Theodosia Kirkbride, the landlady, never mentioned the rent to me.
The landlady, Theodosia Kirkbride, never mentioned the rent to me.
The vile boy, singing an annoying little X Factor tune, kicked my dustbin over.
Here the underlined phrase is most simply analysed as a Verb Phrase, contrasting with the Noun
Phrases above.
What Now?
•
Particular syntactic structures as syntax trees.
NB: Somewhat restrictive notationally. Not the only method used, but is a standard tool.
•
Context-free grammar for capturing rules of syntactic structure (that is generally expressed
as trees). [brief – see also Language and Logic module]
•
Active Chart Parsing – a major, instructive traditional parsing mechanism, based on CF
grammar
•
Dependency Parsing: a major current, statistical learning based technique, with a
different view of syntactic structure.
•
Optional reading (slides to be linked from module webpages):
• Definite Clause Grammars: based on traditional CF grammar in Prolog.
• Recursive Transition Networks: based on traditional CF grammar, in network form
a Syntax Tree
XY : non-terminal syntactic categories
Xyz : terminal syntactic categories
abc : lexical forms
S
NP
Det
VP
Noun
Verb
PP
Prep
dog
sat
on
Prep
NP
Det
The
PP
a
Adj Noun
big
mat
NP
Noun
with
dignity
S
Ambiguous Syntax:
Another Parse
NP
Det
VP
Noun
Verb
PP
Prep
NP
Det Adj
Noun
PP
Prep
NP
Noun
The
dog
sat
on
a
big
mat
with
dignity
Diagnostic Syntax Exercise
(optional homework, not assessed)
For each of the following, annotate each word with a POS (can just be simple ones like
Noun, Verb, etc.) and provide a syntax tree. If you can’t do the latter, at least annotate
some phrases as to what sort of thing they are: “noun phrase”, “verb phrase”,
“prepositional phrase”, “relative clause” or whatever.
1) Dogs hate cats.
2) The dog bit three cats.
3) The dog bit the cat with the long tail.
4) The girl saw the woman with the telescope in the park.
5) The man who came to dinner was my uncle.
6) Shouting at a policeman is not a sensible thing to do.
7) For John to shout at a policeman is unusual.
8) The more I walk the thinner I get.
9) By and large, the British are confused about their identity.
10) Mary believes that Cadillacs are a sort of fruit.
11) Mary wants to eat a Cadillac, boiled then fried.
Context-Free Grammars (CFGs)
•
A CF grammar consists of productions (a form of rule). A production is of form:
<non-terminal symbol>  <seq of one or more symbols (terminal or non-terminal)>
Non-CF grammars: more symbols allowed on LHS.
•
Non-terminal symbols: non-terminal syntactic categories (NP, etc.) and special categories
(e.g., AdjSeq, PPseq – though these could be regarded as new syntactic categories).
Terminal symbols: lexical categories (Noun, etc.) and sometimes specific lexical forms that
play a special syntactic function (e.g.: to {preceding infinitive verb}, by {in passive voice}).
•
Examples of productions:
S  NP VP
S  NP VP PP
NP  Noun
NP  Det ADJSEQ Noun
ADJSEQ  Adj AdjSeq
S  VP
ADJSEQ  Adj
Context-Free Grammars (CFGs), contd.
•
Some other refinements:
• Kleene star for zero or more occurrences:
S  NP VP PP*
• Related + symbol for one or more occurrences: NN-COMPOUND  Noun+
• Square brackets for optional elements: NP  [Det] Adj* Noun
• Upright bar for alternatives: MOD  PP | RELCLAUSE
•
Such refinements reduce the number of productions needed, but are not formally
necessary (they don’t increase the coverage of the grammar).
But can affect precise syntactic structure: PP* causes possibly more branches at parent node,
whereas PPSEQ leads to binary branching.
•
NB: Traditionally, “N”, “V” are used instead of our “Noun”, “Verb”.
Tree view of productions
VP  Verb NP
VP
Verb
NP
Grammatical Categories
•
•
A grammatical category is a dimension along which (some) lexical or syntactic consistuents
can vary in limited, systematic ways, such as (in English):
Number
singular or plural: lexically, nouns, verbs, determiners, numerals
Person
first, second and third: lexically, only for verbs, nouns and some pronouns
Tense
present, past (various forms), future, etc.: lexically, only for verbs
Gender
M, F, N [neither/neuter]: lexically, only some pronouns and some nouns
Syntactic constituents can sometimes inherit grammatical category values from their
components, e.g. (without showing all possible GC values):
the big dog: 3rd person M/F/N singular NP // the big dogs: 3rd person M/F/N plural NP
we in the carpet trade: 1st person M/F plural NP // you silly idiot: 2nd person M/F singular NP
eloped with the gym teacher: past-tense VP // will go: future-tense VP
the woman with the long hair: female NP // the radio with the red knobs: neuter NP
•
A lexical or syntactic constituent can be ambiguous as to a GC value:
e.g. sheep: singular/plural;
manage: singular/plural 1st/2nd person
((Aside))
•
Syntactic constituents can sometimes have grammatical category values arising from the
combination of components, not inherited from (though affected by those of) individual
components:
• Will have gone: future-tense & perfect-aspect VP
{will have: future; gone: past participle}
Grammatical Categories, contd
•
Why worry about grammatical categories? Because there needs to be agreement of certain
sorts. We can’t have the following (the * indicates incorrectness):
•
Within NPs as to number:
* a flowers // * this flowers // * many flower // * three big flower
•
Between VPs and their NP subjects as to number & person, and case when NP is pronoun:
* he eagerly eat [when not subjunctive]
* the women in the theatre eats
* all the women was
* I is
•
// us are // * them went
Between VPs and their pronoun direct objects as to case:
* the guy robbed we
•
Between NPs and pronouns referring to them, as to number and gender:
* the woman took out their gun and shot his dog [when it’s her gun and her dog]
Centre Embedding
• CFGs can deal with unrestricted centre embedding:
The book that the professor recommended is good.
1
1
The book that the professor that the students like recommended is good.
1
2
2
1
The book that the professor that the students that dogs hate like recommended is good.
1
2
3
3
2
1
• English grammar supposedly allows this to unrestricted depth.
• But this is disputed .......
First, some variant examples
• Without the “that”s:
The book is good.
The book the professor recommended is good.
1
1
The book the professor the students like recommended is good.
1
2
2
1
The book the professor the students dogs hate like recommended is good.
1
2
3
3
2
1
• Formally similar but much more difficult yet (even if you put back the thats):
Dogs fight.
Dogs dogs fight fight.
Dogs dogs dogs fight fight fight.
Dogs dogs dogs dogs fight fight fight fight.
The Dispute about such examples
• We feel increasing difficulty with increasing depth of centre embedding,
especially but not exclusively when semantic cues don’t serve to keep the levels
distinct.
• It’s traditional to say that this difficulty is “just” a matter of performance: i.e.
just a matter of practical restrictions on what we can handle by way of grammar,
and doesn’t affect our competence: i.e. the knowledge of grammar that we
actually have.
And certainly if we think hard and long enough we can make sense of any
amount of embedding, it might be argued.
• But another view is that such competence is illusory in the sense that it’s not
part of our natural, unreflective use of language: we can only go beyond the
limitations by special, conscious effort. So the grammar involved in natural
communication does not have things like unrestricted centre embedding.
Many connectionists (neural net theorists), and others, have views like this.
Finite State Automata/Networks
• FSAs cannot achieve unrestricted centre embedding. They are therefore
traditionally said to be inadequate for capturing grammars of natural languages.
• A variant form of network called Recursive Transition Networks (see separate
optional slides) cures the problem.
[Finite State Automata/Networks]
(Optional)
• Notice the basic structure that the centre-embedded examples I gave have is
An B Cn where A, B, C are syntactic categories of some sort, and both
occurrences of n are indeed the same value. (See the numerals annotating
the examples: they show the n layers.)
It can be proved that FSA cannot recognize such structures for arbitrary n
while excluding structures of form An B Cm where m is different from n.
• Intuitively, the basic problem with FSAs is that they can’t remember a count that
could get indefinitely large, in this case n.
Since the automaton can only be in finitely many different states, it’s impossible
for it to reliably know, once something of form An B has been consumed, what n
was in all cases. So there’s no way it could possibly consume just the right
number of Cs in all cases.
[FSAs, contd.] (Optional)
• Exercise: satisfy yourself that if there is some limit on n, then an FSA can work.
Simple exercise: be sure you know how an FSA could recognize things of form An
B Cm where m may be different from or equal to n.
In both those tasks, you can take A, B and C to be single lexical forms a, b and c,
for simplicity.
• The point of those exercises is that they show that the problem is: not being
able to remember an indefinitely large count.