Transcript Document
Natural Language Processing
Artificial Intelligence
CMSC 25000
February 28, 2002
Agenda
• Why NLP?
– Goals & Applications
• Challenges: Knowledge & Ambiguity
– Key types of knowledge
• Morphology, Syntax, Semantics, Pragmatics, Discourse
– Handling Ambiguity
• Syntactic Ambiguity: Probabilistic Parsing
• Semantic Ambiguity: Word Sense Disambiguation
• Conclusions
Why Language?
• Natural Language in Artificial Intelligence
– Language use as distinctive feature of human
intelligence
– Infinite utterances:
• Diverse languages with fundamental similarities
• “Computational linguistics”
– Communicative acts
• Inform, request,...
Why Language? Applications
• Machine Translation
• Question-Answering
– Database queries to web search
• Spoken language systems
• Intelligent tutoring
Knowledge of Language
• What does it mean to know a language?
– Know the words (lexicon)
• Pronunciation, Formation, Conjugation
– Know how the words form sentences
• Sentence structure, Compositional meaning
– Know how to interpret the sentence
• Statement, question,..
– Know how to group sentences
• Narrative coherence, dialogue
Word-level Knowledge
• Lexicon:
– List of legal words in a language
– Part of speech:
• noun, verb, adjective, determiner
• Example:
–
–
–
–
Noun -> cat | dog | mouse | ball | rock
Verb -> chase | bite | fetch | bat
Adjective -> black | brown | furry | striped | heavy
Determiner -> the | that | a | an
Word-level Knowledge: Issues
• Issue 1: Lexicon Size
– Potentially HUGE!
– Controlling factor: morphology
• Store base forms (roots/stems)
– Use morphologic process to generate / analyze
– E.g. Dog: dog(s); sing: sings, sang, sung, singing, singer,..
• Issue 2: Lexical ambiguity
– rock: N/V; dog: N/V;
– “Time flies like a banana”
Sentence-level Knowledge: Syntax
• Language models
– More than just words: “banana a flies time like”
– Formal vs natural: Grammar defines language
Chomsky
Hierarchy
Context A->
aBc
n n
ab
Free
Recursively =Any
Enumerable
Context = AB->BA
n n n
a
Sensitive b c
Regular S->aS
Expression a*b*
Syntactic Analysis: Grammars
• Natural vs Formal languages
– Natural languages have degrees of acceptability
• ‘It ain’t hard’; ‘You gave what to whom?’
• Grammar combines words into phrases
– S-> NP VP
– NP -> {Det} {Adj} N
– VP -> V | V NP | V NP PP
Syntactic Analysis: Parsing
• Recover phrase structure from sentence
– Based on grammar
S
NP
Det Adj
VP
N
V
NP
Det Adj N
The black cat
chased
the
furry mouse
Syntactic Analysis: Parsing
• Issue 1: Complexity
• Solution 1: Chart parser - dynamic
programming
2
Gn
– O(
)
• Issue 2: Structural ambiguity
– ‘I saw the man on the hill with the telescope’
• Is the telescope on the hill?’
• Solution 2 (partial): Probabilistic parsing
Semantic Analysis
• Grammatical = Meaningful
– “Colorless green ideas sleep furiously”
• Compositional Semantics
– Meaning of a sentence is meaning of subparts
– Associate semantic interpretation with syntactic
– E.g. Nouns are variables (themselves): cat,mouse
• Adjectives: unary predicates: Black(cat), Furry(mouse)
• Verbs: multi-place: VP:x chased(x,Furry(mouse))
• Sentence ( x chased(x, Furry(mouse))Black(cat)
– chased(Black(cat),Furry(mouse))
Semantic Ambiguity
• Examples:
– I went to the bank• of the river
• to deposit some money
– He banked
• at First Union
• the plane
• Interpretation depends on
– Sentence (or larger) topic context
– Syntactic structure
Pragmatics & Discourse
• Interpretation in context
– Act accomplished by utterance
• “Do you have the time?”, “Can you pass the salt?”
• Requests with non-literal meaning
– Also, includes politeness, performatives, etc
• Interpretation of multiple utterances
– “The cat chased the mouse. It got away.”
– Resolve referring expressions
Natural Language Understanding
Input
Tokenization/
Morphology
Parsing
Semantic
Analysis
Pragmatics/ Meaning
Discourse
• Key issues:
– Knowledge
• How acquire this knowledge of language?
– Hand-coded? Automatically acquired?
– Ambiguity
• How determine appropriate interpretation?
– Pervasive, preference-based
Handling Syntactic Ambiguity
• Natural language syntax
• Varied, has DEGREES of acceptability
• Ambiguous
• Probability: framework for preferences
– Augment original context-free rules: PCFG
– Add probabilities to transitions
0.2
1.0
PP -> P NP
0.45
NP -> N
VP
-> V
0.65
0.45
NP -> Det N
VP -> V NP
0.10
NP ->
Det Adj N VP0.10
-> V NP PP
0.05
NP -> NP PP
0.85
S -> NP VP
S0.15
-> S conj S
PCFGs
• Learning probabilities
– Strategy 1: Write (manual) CFG,
• Use treebank (collection of parse trees) to find probabilities
– Strategy 2: Use larger treebank (+ linguistic constraint)
• Learn rules & probabilities (inside-outside algorithm)
• Parsing with PCFGs
– Rank parse trees based on probability
– Provides graceful degradation
• Can get some parse even for unusual constructions - low value
Parse Ambiguity
• Two parse trees
S
S
NP
N
I
NP
VP
V
NP PP
Det N P NP
Det N
N
VP
V NP
NP PP
Det N P NP
Det N
saw the man with the telescope I saw the man with the telescope
Parse Probabilities
P(T , S ) p(r (n))
nT
– T(ree),S(entence),n(ode),R(ule)
– T1 = 0.85*0.2*0.1*0.65*1*0.65 = 0.007
– T2 = 0.85*0.2*0.45*0.05*0.65*1*0.65 = 0.003
• Select T1
• Best systems achieve 92-93% accuracy
Semantic Ambiguity
• “Plant” ambiguity
– Botanical vs Manufacturing senses
• Two types of context
– Local: 1-2 words away
– Global: several sentence window
• Two observations (Yarowsky 1995)
– One sense per collocation (local)
– One sense per discourse (global)
Learn Disambiguators
• Initialize small set of “seed” cases
• Collect local context information
– “collocations”
• E.g. 2 words away from “production”, 1 word from “seed”
•
•
•
•
Contexts = rules
Make decision list= rules ranked by mutual info
Iterate: Labeling via DL, collecting contexts
Label all entries in discourse with majority sense
– Repeat
Disambiguate
• For each new unlabeled case,
– Use decision list to label
• > 95% accurate on set of highly ambiguous
– Also used for accent restoration in e-mail
Natural Language Processing
• Goals: Understand and imitate distinctive
human capacity
• Myriad applications: MT, Q&A, SLS
• Key Issues:
– Capturing knowledge of language
• Automatic acquisition current focus: linguistics+ML
– Resolving ambiguity, managing preference
• Apply (probabilistic) knowledge
• Effective in constrained environment