PPT - Search

Download Report

Transcript PPT - Search

Human Language Technology
Sentence Grammar
November 2009
HLT - Sentence Grammar
1
Introduction
This lecture has several themes:
1. Crash course in sentence-level grammar
•
•
Jurafsky and Martin 2nd ed. Chapter 12
Internet Grammar of English
http://www.ucl.ac.uk/internet-grammar/
2. Show how different linguistic phenomena can be
captured by grammar rules.
3. Dependency Parsing
4. Tagsets and Treebanks
November 2009
HLT - Sentence Grammar
2
Part 1
Grammar of English
November 2009
HLT - Sentence Grammar
3
Different Kinds of Rule
• Morphological rules.. govern how words
may be composed: re+invest+ing =
reinvesting.
• Syntactic rules .. govern how words
and constituents combine to form
grammatical sentences.
• Semantic rules .. govern how meanings may
be combined.
November 2009
HLT - Sentence Grammar
4
Syntax: Why?
• You need knowledge of syntax in many
applications:
–
–
–
–
–
–
Parsing
Grammar checkers
Question answering/database access
Information extraction
Generation
Translation
• Full versus superficial analysis?
November 2009
HLT - Sentence Grammar
5
Levels of Grammar Organisation
• Word Classes: different parts of speech (POS).
• Phrase Classes: sequences of words inheriting the
characteristics of certain word classes.
• Clause Classes: sequences of phrases containing
at least one verb phrase.
On the basis of these one may define:
• Grammatical Relations: role played by
constitutents e.g. subject; predicate; object
• Syntax-Semantics interface: mapping between
syntactic structures and meaning
November 2009
HLT - Sentence Grammar
6
Word Classes
• Closed classes.
–
–
–
–
determiners : the, a, an, four.
pronouns : it, he etc.
prepositions : by, on, with .
conjunctions : and, or, but.
• Open classes.
–
–
–
–
nouns refer to objects or concepts: cat , beauty , Coke.
adjectives describe or qualify nouns: fried chickens.
verbs describe what the noun does: John jumps.
adverbs describe how it is done: John runs quickly.
November 2009
HLT - Sentence Grammar
7
Word Class Characteristics
• Different word classes have characteristic
subclasses and properties
Subclasses
Properties
Noun
proper; mass;
count
number;
gender
Verb
transitive;
intransitive
Number; gender;
person, tense
Adjective
dimension; age;
colour
Number,
gender
November 2009
HLT - Sentence Grammar
8
Phrases
• Longer phrases may be used rather than a single
word, but fulfilling the same role in a sentence.
– Noun phrases refer to objects: four fried chickens.
– Verb phrases state what the noun phrase does: kicks the
dog.
– Adjective phrases describe/qualify an object:
sickly sweet.
– Adverbial phrases describe how actions are done:
very carefully.
– prepositional phrases: add information to a verb
phrase: on the table
November 2009
HLT - Sentence Grammar
9
Phrases can be Complex
e.g. Noun Phrases
• Proper Name or Pronoun: Monday; it
• Specifier, noun: the day
• Specifiers, premodifier, noun:
the first wet day
• Specifiers, premodifier, noun, postmodifier:
The first wet day that I enjoyed in June
November 2009
HLT - Sentence Grammar
10
But they all fit the same context
•
•
•
•
•
Monday
It
The day
The first wet day
The first wet day that I
enjoyed in June
November 2009
was sunny.
HLT - Sentence Grammar
11
Clauses
• A clause is a combination of noun phrases and
verb phrases
• Clauses can exist at the top level (main clause) or
can be embedded (subordinate clause)
– Top level clause is a sentence. E.g.
The cat ate the mouse.
– Embedded clause is subordinate e.g.
John said that Sandy is sick.
• Unlike phrases, whole sentences can be used to
say something complete, e.g. to state a fact or ask a
question.
November 2009
HLT - Sentence Grammar
12
Different Kinds of Sentences
•
•
•
•
•
Assertion: John ate the cat.
Yes/No question: Did John eat the cat?
Wh- question: What did John eat?
Command: Eat the cat John!
NB. All these forms share the same
underlying semantic proposition.
November 2009
HLT - Sentence Grammar
13
Part II
Context Free Grammar Rules
November 2009
HLT - Sentence Grammar
14
Formal Grammar
• A formal grammar consists of
–
–
–
–
Terminal Symbols (T)
Non Terminal Symbols (NT, disjoint from TS)
Start Symbol (a distinguished NT)
Rewrite rules of the form , where  and 
are strings of symbols
November 2009
HLT - Sentence Grammar
15
Classes of Grammar
Type Grammars
Languages
Machines
Recursively
Enumerable
TM
0
Phrase Structure
1
Context
Context Sensitive
Sensitive
LBA
2
Context Free
Context Free
PDA
3
Regular
Regular
FSA
November 2009
HLT - Sentence Grammar
16
Classes of Grammar
• Learnability
• Different classes of grammar result from
various restrictions on the form of rules
November 2009
HLT - Sentence Grammar
17
Restrictions on Rules
For all rules 
• Type 0 (unrestricted): no restrictions
• Type 1 (context sensitive): ||||
• Type 2 (context free):
 is a single NT symbol
• Type 3 (regular)
– Every rule is of the form A  aB or A  a
where A,B NT and aT
November 2009
HLT - Sentence Grammar
18
Which Class for NLP?
• Type 3 (Regular). Good for morphology. Cannot
handle central embedding of sentences.
The man that John saw eating died.
• Type 2 (Context Free). OK but problems handling
certain phenomena e.g. agreement.
• Type 1 (Context Sensitive). Computational
properties not well understood. Too powerful.
• Type 0 (Turing). Too powerful.
November 2009
HLT - Sentence Grammar
19
Weak versus Strong
• Grammar class that is too restrictive
– cannot characterise/discriminate exactly NL
sentence structures.
• Grammar class that is too general
– has the power to characterise/discriminate
structures that don't exist in human languages.
• More general, higher complexity
→ less efficient computations.
November 2009
HLT - Sentence Grammar
20
Example Grammar
Cabinet discusses police chief’s case
French gunman kills four
s
np
np
np
vp
November 2009





np vp
n
adj n
n np
v np
HLT - Sentence Grammar
21
Classifying the Symbols
• NT – symbols appearing on the left
• Start – symbol appearing only on the left from
which every other symbol can be derived.
• T – symbols appearing only on the right
• To include words we also need special rules
such as
n
n
n
 [police]
 [gunman]
 [four]
• Latter rules define the lexicon or “dictionary
interface”.
November 2009
HLT - Sentence Grammar
22
Grammar Induces
Phrase Structure
s
vp
np
adj
French
November 2009
np
n
gunman
HLT - Sentence Grammar
v
n
kills
four
23
Phrase Structure
• PS includes information about
– precedence between constituents
– dominance between constituents
• PS constitutes a trace of the rule
applications used to derive a sentence
• PS does not tell you the order in which the
rules were used
November 2009
HLT - Sentence Grammar
24
Procedural versus Declarative
• A grammar induces a structure but does not
tell you how to discover that structure
• A grammar is declarative
• A parser is a procedure that, given a suitable
representation of a grammar and a sentence,
actually discovers the structure(s).
• A parser is procedural
November 2009
HLT - Sentence Grammar
25
Handling Linguistic Phenomena
•
•
•
•
•
Different sentence-types
Nested structures
Agreement
Multiwords
Subcategories of verb
November 2009
HLT - Sentence Grammar
26
Different Sentence Types....
....Different Grammar Rules
• Declaratives
John left.
S → NP VP
• Imperatives
Leave!
S →VP
• Yes-No Questions
Did John leave?
S →Aux NP VP
• WH Questions
When did John leave?
S →Wh-word Aux NP VP
November 2009
HLT - Sentence Grammar
27
Recursively Nested
Structures handled by ....
•
•
•
•
Flights to Miami
Flights to Miami from Boston
Flights to Miami from Boston in April
Flights to Miami from Boston in April on
Friday
• Flights to Miami from Boston in April on
Friday under $300.
• Flights to Miami from Boston in April on
Friday under $300 with lunch.
November 2009
HLT - Sentence Grammar
28
Recursive Rules
NP → N
NP → NP PP
PP → Preposition NP
Flights from miami to boston
November 2009
HLT - Sentence Grammar
29
Ambiguity
np 
pp 
np pp
prep np
(the man) (on the hill with a telescope by the sea)
(the man on the hill) (with a telescope by the sea)
(the man on the hill with a telescope)( by the sea)
etc.
November 2009
HLT - Sentence Grammar
30
Handling Agreement
• NP → Determiner N
• Include these days, this day
• Exclude this days, these day
NP → NPSing
NP → NPPlur
NPPlur → DetSing NSing
NPPlur → DetPlur NPlur
• Agreement also includes number, gender, case.
• Danger: proliferation of categories/rules.
November 2009
HLT - Sentence Grammar
31
Handling Multiwords
•
•
•
•
•
John ran up the stairs
John rang up the doctor
John ran the stairs up*
John rang the doctor up
John rang the doctor who lives in Paris up
November 2009
HLT - Sentence Grammar
32
Ordinary CF rules don’t work
John rang up the doctor
• VP → V NP
• here V is multiword
John rang the doctor up
• VP → V NP particle_from _V
• here, multiword has split into two parts
• challenge is to express the relation between the
parts
November 2009
HLT - Sentence Grammar
33
Subcategorisation
• Intransitive verb: no object
John disappeared
John disappeared the cat*
• Transitive verb: one object
John opened the window
John opened*
• Ditransitive verb: two objects
John gave Mary the book
John gave Mary*
November 2009
HLT - Sentence Grammar
34
Subcategorisation Rules
• Intransitive verb: no object
VP → V
• Transitive verb: one object
VP → V NP
• Ditransitive verb: two objects
VP → V NP NP
• If you take account of the category of items
following the verb, there are about 40
different patterns like this in English.
November 2009
HLT - Sentence Grammar
35
Overgeneration
• A grammar should generate only sentences in the
language.
• It should exclude sentences not in the language.
s  n vp
vp  v
n  [John]
v  [snore]
v  [snores]
November 2009
HLT - Sentence Grammar
36
Undergeneration
• A grammar should generate all sentences in the
language.
• There should not be sentences in the language that
are not generated by the grammar.
s  n vp
vp  v
n  [John]
n  [gold]
v  [found]
November 2009
HLT - Sentence Grammar
37
Appropriate Stuctures
– A grammar should assign linguistically
plausible structures.
s
s
vs.
np
vp
np
n v d a
n
John ate a juicy hamburger
November 2009
vp
n v d a
n
John ate a juicy hamburger
HLT - Sentence Grammar
38
Criteria for Evaluating Grammars
• Does it undergenerate?
• Does it overgenerate?
• Does it assign appropriate structures to sentences
it generates?
• Is it simple to understand? How many rules are
there?
• Does it contain generalisations or is it just a
collection of special cases?
• How ambiguous is it?
November 2009
HLT - Sentence Grammar
39
Tagsets
• The main parts of speech reflect naturally
occurring occurrence data.
• Practical applications often make use of
special tags which include additional
information such as number and case.
• One of the most commonly used tagsets is
the 45-tag Penn Treebank tagset, used for
the Brown corpus.
November 2009
HLT - Sentence Grammar
40
Penn Treebank Tagset
November 2009
HLT - Sentence Grammar
41
POS Tagging
The grand jury commented on a number of other topics
POS Tagger
The/DT grand/JJ jury/NN commmented/VBD
on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.
November 2009
HLT - Sentence Grammar
42
Treebanks
• Treebanks are corpora in which each sentence has been
paired with a parse tree (presumably the right one).
• These are generally created
– By first parsing the collection with an automatic parser
– And then having human annotators correct each parse
as necessary.
• This requires detailed annotation guidelines that provide a
– POS tagset,
– a grammar and
– instructions for how to deal with particular grammatical
constructions.
November 2009
HLT - Sentence Grammar
43
Penn Treebank
• Penn Treebank is a widely used treebank
maintained by the Linguistic Data Consortium.
• The Penn Treebank Project annotates naturallyoccurring text for linguistic structure.
• Contains skeletal parses showing rough syntactic
and semantic information -- a bank of linguistic
trees.
• Most well known is the Wall Street Journal section
containing 1M words from the 1987-1989 Wall
Street Journal.
November 2009
HLT - Sentence Grammar
44
Penn Treebank Example
November 2009
HLT - Sentence Grammar
45
Treebank Grammars
• Treebanks implicitly define a grammar for
the language covered in the treebank.
• Simply take the local rules that make up the
sub-trees in all the trees in the collection
and you have a grammar.
• Not complete, but if you have decent size
corpus, you’ll have a grammar with decent
coverage.
November 2009
HLT - Sentence Grammar
46
Treebank Grammars
• Such grammars tend to be very flat due to
the fact that they tend to avoid recursion.
• For example, the Penn Treebank has 4500
different rules for VPs. Among them...
November 2009
HLT - Sentence Grammar
47
Heads in Trees
• Finding heads in treebank trees is a task that
arises frequently in many applications.
– Particularly important in statistical
parsing
• We can visualize this task by annotating the
nodes of a parse tree with the heads of each
corresponding node.
November 2009
HLT - Sentence Grammar
48
Lexically Decorated Tree
November 2009
HLT - Sentence Grammar
49
Head Finding
• The standard way to do head finding is to
use a simple set of tree traversal rules
specific to each non-terminal in the
grammar.
November 2009
HLT - Sentence Grammar
50
Noun Phrases
November 2009
HLT - Sentence Grammar
51
Treebank Uses
• Treebanks (and headfinding) are
particularly critical to the development of
statistical parsers.
• Also valuable for Corpus Linguistics
– Investigating the empirical details of various
constructions in a given language
November 2009
HLT - Sentence Grammar
52
Dependency Grammars
• In CFG-style phrase-structure grammars the main
focus is on constituents.
• But it turns out you can get a lot done with just
binary relations among the words in an utterance.
• In a dependency grammar framework, a parse is a
tree where
– the nodes stand for the words in an utterance
– The links between the words represent dependency
relations between pairs of words.
– Relations may be typed (labeled), or not.
November 2009
HLT - Sentence Grammar
53
Dependency Relations
November 2009
HLT - Sentence Grammar
54
Dependency Parse
They hid the letter on the shelf
November 2009
HLT - Sentence Grammar
55
Dependency Parsing
• The dependency approach has a number of
advantages over full phrase-structure parsing.
– Deals well with free word order languages
where the constituent structure is quite fluid
– Parsing is much faster than CFG-based parsers
– Dependency structure often captures the
syntactic relations needed by later applications
• CFG-based approaches often extract this same
information from trees anyway.
November 2009
HLT - Sentence Grammar
56
Dependency Parsing
• There are two modern approaches to
dependency parsing
– Optimization-based approaches that search a
space of trees for the tree that best matches
some criteria
– Shift-reduce approaches that greedily take
actions based on the current word and state.
November 2009
HLT - Sentence Grammar
57
Summary
• Context-free grammars can be used to model various facts
about the syntax of a language.
• When paired with parsers, such grammars consititute a
critical component in many applications.
• Constituency is a key phenomenon easily captured with
CFG rules.
– But certain linguistic phenomena pose significant
problems
• Dependency parsing is suitable for languages with free
word order
• Treebanks pair sentences in corpus with their
corresponding trees.
November 2009
HLT - Sentence Grammar
58