EECS 595 / LING 541 / SI 661 Natural Language Processing
Download
Report
Transcript EECS 595 / LING 541 / SI 661 Natural Language Processing
EECS 595 / LING 541 / SI 661&761
Natural Language Processing
Fall 2005
Lecture Notes #2
Course logistics
• Instructor: Prof. Dragomir Radev ([email protected])
Ph.D., Computer Science, Columbia University
Formerly at IBM TJ Watson Research Center
• Times: Thursdays 2:40-5:25 PM, in 411, West Hall
• Office hours: TBA, 3080 West Hall Connector
Course home page:
http://www.si.umich.edu/~radev/NLP-fall2005
Linguistic Fundamentals
Syntactic categories
• Substitution test:
Nathalie likes
{
black
Persian
tabby
small
}
cats.
• Open (lexical) and closed (functional) categories:
No-fly-zone
yadda yadda yadda
the
in
Morphology
The dog chased the yellow bird.
•
•
•
•
•
•
Parts of speech: eight (or so) general types
Inflection (number, person, tense…)
Derivation (adjective-adverb, noun-verb)
Compounding (separate words or single word)
Part-of-speech tagging
Morphological analysis (prefix, root, suffix,
ending)
Part of speech tags
From Church (1991) - 79 tags
NN
IN
AT
NP
JJ
,
NNS
CC
RB
VB
VBN
VBD
CS
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
singular noun */
preposition */
article */
proper noun */
adjective */
comma */
plural noun */
conjunction */
adverb */
un-inflected verb */
verb +en (taken, looked (passive,perfect)) */
verb +ed (took, looked (past tense)) */
subordinating conjunction */
Jabberwocky (Lewis Carroll)
`Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.
"Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!"
Nouns
• Nouns: dog, tree, computer, idea
• Nouns vary in number (singular, plural),
gender (masculine, feminine, neuter), case
(nominative, genitive, accusative, dative)
• Latin: filius (m), filia (f), filium (object)
German: Mädchen
• Clitics (‘s)
Pronouns
• Pronouns: she, ourselves, mine
• Pronouns vary in person, gender, number, case (in
English: nominative, accusative, possessive, 2nd
possessive, reflexive)
Mary saw her in the mirror.
Mary saw herself in the mirror.
• Anaphors: herself, each other
Determiners and adjectives
•
•
•
•
•
•
Articles: the, a
Demonstratives: this, that
Adjectives: describe properties
Attributive and predicative adjectives
Agreement: in gender, number
Comparative and superlative (derivative and
periphrastic)
• Positive form
Verbs
•
•
•
•
•
•
•
•
•
•
Actions, activities, and states (throw, walk, have)
English: four verb forms
tenses: present, past, future
other inflection: number, person
gerunds and infinitive
aspect: progressive, perfective
voice: active, passive
participles, auxiliaries
irregular verbs
French and Finnish: many more inflections than English
Other parts of speech
• Adverbs, prepositions, particles
• phrasal verbs (the plane took off, take it off)
• particles vs. prepositions (she ran up a
bill/hill)
• Coordinating conjunctions: and, or, but
• Subordinating conjunctions: if, because,
that, although
• Interjections: Ouch!
Phrase structure
• Constraints on word order
• Constituents: NP, PP, VP, AP
• Phrase structure grammars
S
NP
PN
VP
V
N
Spot chased Det
a
N
bird
Phrase structure
• Paradigmatic relationships (e.g., constituency)
• Syntagmatic relationships (e.g., collocations)
S
NP
That
VP
man
VBD
PP
NP
caught the butterfly
NP
IN
with
a
net
Phrase-structure grammars
Peter gave Mary a book.
Mary gave Peter a book.
•
•
•
•
•
•
•
Constituent order (SVO, SOV)
imperative forms
sentences with auxiliary verbs
interrogative sentences
declarative sentences
start symbol and rewrite rules
context-free view of language
Sample phrase-structure grammar
S
NP
NP
NP
VP
VP
VP
P
NP
AT
AT
NP
VP
VBD
VBD
IN
VP
NNS
NN
PP
PP
NP
NP
AT
NNS
NNS
NNS
VBD
VBD
VBD
IN
IN
NN
the
children
students
mountains
slept
ate
saw
in
of
cake
Phrase structure grammars
• Local dependencies
• Non-local dependencies
• Subject-verb agreement
The women who found the wallet were given a reward.
• wh-extraction
Should Peter buy a book?
Which book should Peter buy?
• Empty nodes
Dependency: arguments and adjuncts
Sue watched the man at the next table.
• Event + dependents (verb arguments are usually
NPs)
• agent, patient, instrument, goal - semantic roles
• subject, direct object, indirect object
• transitive, intransitive, and ditransitive verbs
• active and passive voice
Subcategorization
• Arguments: subject + complements
• adjuncts vs. complements
• adjuncts are optional and describe time,
place, manner…
• subordinate clauses
• subcategorization frames
Subcategorization
Subject: The children eat candy.
Object: The children eat candy.
Prepositional phrase: She put the book on the table.
Predicative adjective: We made the man angry.
Bare infinitive: She helped me walk.
To-infinitive: She likes to walk.
Participial phrase: She stopped singing that tune at the
end.
That-clause: She thinks that it will rain tomorrow.
Question-form clauses: She asked me what book I was
reading.
Subcategorization frames
•
•
•
•
•
•
•
Intransitive verbs: The woman walked
Transitive verbs: John loves Mary
Ditransitive verbs: Mary gave Peter flowers
Intransitive with PP: I rent in Paddington
Transitive with PP: She put the book on the table
Sentential complement: I know that she likes you
Transitive with sentential complement: She told
me that Gary is coming on Tuesday
Selectional restrictions and
preferences
• Subcategorization frames capture syntactic
regularities about complements
• Selectional restrictions and preferences
capture semantic regularities: bark, eat
Phrase structure ambiguity
• Grammars are used for generating and parsing
sentences
• Parses
• Syntactic ambiguity
• Attachment ambiguity: Our company is training
workers.
• The children ate the cake with a spoon.
• High vs. low attachment
• Garden path sentences: The horse raced past the
barn fell. Is the book on the table red?
Ungrammaticality vs. semantic
abnormality
* Slept children the.
# Colorless green ideas sleep furiously.
# The cat barked.
Semantics and pragmatics
• Lexical semantics and compositional semantics
• Hypernyms, hyponyms, antonyms, meronyms and
holonyms (part-whole relationship, tire is a
meronym of car), synonyms, homonyms
• Senses of words, polysemous words
• Homophony (bass).
• Collocations: white hair, white wine
• Idioms: to kick the bucket
Discourse analysis
• Anaphoric relations:
1. Mary helped Peter get out of the car. He thanked her.
2. Mary helped the other passenger out of the car.
The man had asked her for help because of his foot
injury.
• Information extraction problems (entity crossreferencing)
Hurricane Hugo destroyed 20,000 Florida homes.
At an estimated cost of one billion dollars, the disaster
has been the most costly in the state’s history.
Pragmatics
• The study of how knowledge about the
world and language conventions interact
with literal meaning.
• Speech acts
• Research issues: resolution of anaphoric
relations, modeling of speech acts in
dialogues
Other areas of NLP
• Linguistics is traditionally divided into phonetics,
phonology, morphology, syntax, semantics, and
pragmatics.
• Sociolinguistics: interactions of social
organization and language.
• Historical linguistics: change over time.
• Linguistic typology
• Language acquisition
• Psycholinguistics: real-time production and
perception of language
Word classes and
part-of-speech tagging
Part of speech tagging
•
•
•
•
Problems: transport, object, discount, address
More problems: content
French: est, président, fils
“Book that flight” – what is the part of speech
associated with “book”?
• POS tagging: assigning parts of speech to words in
a text.
• Three main techniques: rule-based tagging,
stochastic tagging, transformation-based tagging
Rule-based POS tagging
• Use dictionary or FST to find all possible
parts of speech
• Use disambiguation rules (e.g., ART+V)
• Typically hundreds of constraints can be
designed manually
Example in French
<S>
^
beginning of sentence
La
rf b nms u
article
teneur
nfs nms
noun feminine singular
Moyenne
jfs nfs v1s v2s v3s
adjective feminine singular
en
p a b
preposition
uranium
nms
noun masculine singular
des
p r
preposition
rivi`eres
nfp
noun feminine plural
,
x
punctuation
bien_que
cs
subordinating conjunction
délicate
jfs
adjective feminine singular
À
p
preposition
calculer
v
verb
Sample rules
BS3 BI1: A BS3 (3rd person subject personal pronoun) cannot be followed by a
BI1 (1st person indirect personal pronoun). In the example: ``il nous faut'' ({\it
we need}) - ``il'' has the tag BS3MS and ``nous'' has the tags [BD1P BI1P
BJ1P BR1P BS1P]. The negative constraint ``BS3 BI1'' rules out ``BI1P'', and
thus leaves only 4 alternatives for the word ``nous''.
N K: The tag N (noun) cannot be followed by a tag K (interrogative pronoun); an
example in the test corpus would be: ``... fleuve qui ...'' (...river, that...). Since
``qui'' can be tagged both as an ``E'' (relative pronoun) and a ``K''
(interrogative pronoun), the ``E'' will be chosen by the tagger since an
interrogative pronoun cannot follow a noun (``N'').
R V:A word tagged with R (article) cannot be followed by a word tagged with V
(verb): for example ``l' appelle'' (calls him/her). The word ``appelle'' can only
be a verb, but ``l''' can be either an article or a personal pronoun. Thus, the
rule will eliminate the article tag, giving preference to the pronoun.
Stochastic POS tagging
• HMM tagger
• Pick the most likely tag for this word
• P(word|tag) * P(tag|previous n tags) – find tag
sequence that maximizes the probability formula
• A bigram-based HMM tagger chooses the tag ti for
word wi that is most probable given the previous
tag ti-1 and the current word wi:
• ti = argmaxj P(tj|ti-1,wi)
• ti = argmaxj P(tj|ti-1)P(wi|tj) : HMM equation for a
single tag
Example
• Secretariat/NNP is/VBZ expected/VBN
to/TO race/VB tomorrow/ADV
• People/NNS continue/VBP to/TO
inquire/VB the/DT reason/NN for/IN
the/DT race/NN for/IN outer/JJ space/NN
• P(VB|TO)P(race|VB)
• P(NN|TO)P(race|NN)
• TO: to+VB (to sleep), to+NN (to school)
Example (cont’d)
•
•
•
•
•
•
P(NN|TO) = .021
P(VB|TO) = .34
P(race|NN) = .00041
P(race|VB) = .00003
P(VB|TO)P(race|VB) = .00001
P(NN|TO)P(race|NN) = .000007
HMM Tagging
• T = argmax P(T|W), where T=t1,t2,…,tn
• By Bayes’ rule: P(T|W) = P(T)P(W|T)/P(W)
• Thus we are attempting to choose the sequence of
tags that maximizes the rhs of the equation
• P(W) can be ignored
• P(T)P(W|T) =
P(wi|w1t1…wi-1ti1ti)P(ti|w1t1…wi-1ti-1)
Transformation-based learning
•
•
•
•
P(NN|race) = .98
P(VB|race) = .02
Change NN to VB when the previous tag is TO
Types of rules:
–
–
–
–
–
The preceding (following) word is tagged z
The word two before (after) is tagged z
One of the two preceding (following) words is tagged z
One of the three preceding (following) words is tagged z
The preceding word is tagged z and the following word is
tagged w
Confusion matrix
IN
JJ
IN
-
.2
JJ
.2
-
3.3
NN
8.7
-
NNP .2
3.3
4.1
RB
2.0
.5
VBD
.3
.5
VBN
2.8
2.2
NN
NNP RB
VBD VBN
.7
2.1
1.7
.2
2.7
.2
-
.2
-
4.4
2.6
-
Most confusing: NN vs. NNP vs. JJ, VBD vs. VBN vs. JJ
Readings
• J&M Chapters 1, 2, 3, 8
• “What is Computational Linguistics” by
Hans Uszkoreit
http://www.coli.uni-sb.de/~hansu/what_is_cl.html
• Lecture notes #1
Readings
• J&M Chapters 3, 8
• Lecture notes #2