ppt - University of Illinois Urbana

Download Report

Transcript ppt - University of Illinois Urbana

龙星计划课程:信息检索
Course Overview & Background
ChengXiang Zhai (翟成祥)
Department of Computer Science
Graduate School of Library & Information Science
Institute for Genomic Biology, Statistics
University of Illinois, Urbana-Champaign
http://www-faculty.cs.uiuc.edu/~czhai, [email protected]
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
1
Outline
• Course overview
• Essential background
– Probability & statistics
– Basic concepts in information theory
– Natural language processing
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
2
Course Overview
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
3
Course Objectives
• Introduce the field of information retrieval (IR)
– Foundation: Basic concepts, principles, methods,
etc
– Trends: Frontier topics
• Prepare students to do research in IR and/or
related fields
– Research methodology (general and IR-specific)
– Research proposal writing
– Research project (to be finished after the lecture
period)
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
4
Prerequisites
• Proficiency in programming (C++ is needed for
assignments)
• Knowledge of basic probability & statistics
(would be
necessary for understanding algorithms deeply)
• Big plus: knowledge of related areas
– Machine learning
– Natural language processing
– Data mining
–…
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
5
Course Management
•
Teaching staff
– Instructor: ChengXiang Zhai (UIUC)
– Teaching assistants:
• Hongfei Yan (Peking Univ)
•
•
•
• Bo Peng (Peking Univ)
Course website: http://net.pku.edu.cn/~course/cs410/
Course group discussion:
http://groups.google.com/group/cs410pku
Questions: First post the questions on the group discussion
forum; if questions are unanswered, bring them to the office
hours (first office hour: June 23, 2:30-4:30pm)
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
6
Format & Requirements
• Lecture-based:
– Morning lectures: Foundation & Trends
– Afternoon lectures: IR research methodology
– Readings are usually available online
• 2 Assignments (based on morning lectures)
– Coding (C++), experimenting with data, analyzing
results, open explorations (~5 hours each)
• Final exam (based on morning lectures): 1:304:30pm, June 30.
– Practice questions will be available
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
7
Format & Requirements (cont.)
•
Course project (Mini-TREC)
– Work in teams
– Phase I: create test collections (~ 3 hours, done within
lecture period)
– Phase II: develop algorithms and submit results (done in
the summer)
•
Research project proposal (based on afternoon lectures)
– Work in teams
– 2-page outline done within lecture period
– full proposal (5 pages) due later
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
8
Coverage of Topics: IR vs. TIM
IR and TIM will be used interchangeably
Text Information Management
(TIM)
Information Retrieval
(IR)
Multimedia, etc
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
9
What is Text Info. Management?
• TIM is concerned with technologies for managing
and exploiting text information effectively and
efficiently
• Importance of managing text information
– The most natural way of encoding knowledge
• Think about scientific literature
– The most common type of information
• How much textual information do you produce and
consume every day?
– The most basic form of information
• It can be used to describe other media of information
– The most useful form of information!
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
10
Text Management Applications
Mining
Access
Select
information
Create Knowledge
Organization
2008 © ChengXiang Zhai
Add
Structure/Annotations
Dragon Star Lecture at Beijing University, June 21-30, 2008
11
Examples of Text
Management Applications
•
•
•
•
•
Search
–
–
–
Web search engines (Google, Yahoo, …)
Library systems
…
Recommendation
–
–
News filter
Literature/movie recommender
Categorization
–
–
Automatically sorting emails
…
Mining/Extraction
–
–
–
–
Discovering major complaints from email in customer service
Business intelligence
Bioinformatics
…
Many others…
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
12
Elements of Text Info Management
Technologies
Retrieval
Applications
Visualization
Summarization
Filtering
Information
Access
Focus of
the course
Mining
Applications
Mining
Information
Organization
Search
Categorization
Extraction
Knowledge
Acquisition
Clustering
Natural Language Content Analysis
Text
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
13
Text Management and Other Areas
User
Human-computer interaction
Software engineering
Web
Probabilistic inference
Machine learning
Information
Science
TM Applications
TM Algorithms
Natural language processing
Storage
Compression
Computer
science
Text
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
14
Related Areas
Applications
Models
Statistics
Optimization
Machine Learning
Pattern Recognition
Data Mining
Natural
Language
Processing
Applications
Web, Bioinformatics…
Information
Retrieval
Library & Info
Science
Databases
Software engineering
Computer systems
Algorithms
Systems
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
15
Publications/Societies (Incomplete)
Learning/Mining
ICML
ISMB
ICML, NIPS, UAI
AAAI
NLP ACL
WWW
RECOMB, PSB
ACM SIGKDD
Statistics
Applications
HLT
COLING, EMNLP, ANLP
Info. Science
Info Retrieval
ACM SIGIR
ACM CIKM
JCDL
TREC
SOSP
Software/systems
OSDI
2008 © ChengXiang Zhai
ASIS
Databases
ACM SIGMOD
VLDB, PODS, ICDE
Dragon Star Lecture at Beijing University, June 21-30, 2008
16
Schedule: available at
http://net.pku.edu.cn/~course/cs410/
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
17
Date
Morning Lecture (8:30-11:30)
(Foundation & Trends)
Afternoon Lecture (1:30-2:30)
(Research Methodology)
6/21 Sat
Course overview and background (probability, statistics,
information theory, NLP)
Slides: ppt
Lecture Notes: Prob & Stat, Info Theory, NLP
Readings:
Bush 45, Rosenfeld's note on estimation, Rosenfeld's note on
information theory,
Introduction to IR research
Slides: ppt
6/22
Sun
Information Retrieval Overview (part 1) (basic concepts, history,
evaluation)
Lecture Notes: text retrieval,
Readings:
Singhal's review (Error), Book-Ch8. TREC measures
Prepare yourself for IR research
Mini-TREC task specification ready
6/23
Mon
Information Retrieval Overview (Part 2) (basic retrieval models,
system implementation, applications)
Find a good IR research topic
Assign #1 out
6/24
Tue
Statistical Language Models for IR (probabilistic retrieval models,
KL-divergence model, special retrieval tasks)
Formulate IR research
hypotheses
Assign #2 out
6/25
Wed
Modern Retrieval Frameworks (axiomatic, decision-theoretic)
6/26
Thu
Personalized Search & User Modeling (implicit feedback, explicit
feedback, active feedback)
Test/Refine IR research
hypotheses
6/27 Fri
Natural Language Processing for IR (phrase indexing, dependency
analysis, sense disambiguation, sentiment retrieval)
Write and publish an IR paper
6/28 Sat
No class
Mini-TREC Phase I Task due
6/29
Sun
Topic Models for Text mining (PLSA, LDA, extensions and
applications)
Proposal outline due
6/30
Mon
Future of IR, course summary
7/5 Sat
Research proposal due
7/?
Mini-TREC data sets ready
8/?
Mini-TREC Phase II Task due
Final exam practice questions
available
Final Exam (1:30-4:30)
2008 © ChengXiang Zhai
Notes
Proposal team due
Assigns #1, #2 due
Dragon Star Lecture at Beijing University, June 21-30, 2008
18
Essential Backgroud 1:
Probability & Statistics
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
19
Prob/Statistics & Text Management
• Probability & statistics provide a principled way
to quantify the uncertainties associated with
natural language
• Allow us to answer questions like:
– Given that we observe “baseball” three times and “game” once
in a news article, how likely is it about “sports”?
(text categorization, information retrieval)
– Given that a user is interested in sports news, how likely would
the user use “baseball” in a query?
(information retrieval)
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
20
Basic Concepts in Probability
•
•
•
Random experiment: an experiment with uncertain outcome (e.g.,
tossing a coin, picking a word from text)
Sample space: all possible outcomes, e.g.,
– Tossing 2 fair coins, S ={HH, HT, TH, TT}
Event: ES, E happens iff outcome is in E, e.g.,
– E={HH} (all heads)
– E={HH,TT} (same face)
•
– Impossible event ({}), certain event (S)
Probability of Event : 1P(E) 0, s.t.
– P(S)=1 (outcome always in S)
– P(A B)=P(A)+P(B) if (AB)= (e.g., A=same face, B=different face)
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
21
Basic Concepts of Prob. (cont.)
•
Conditional Probability :P(B|A)=P(AB)/P(A)
– P(AB) = P(A)P(B|A) =P(B)P(A|B)
– So, P(A|B)=P(B|A)P(A)/P(B) (Bayes’ Rule)
– For independent events, P(AB) = P(A)P(B), so P(A|B)=P(A)
•
Total probability: If A1, …, An form a partition of S, then
– P(B)= P(BS)=P(BA1)+…+P(B An) (why?)
– So, P(Ai|B)=P(B|Ai)P(Ai)/P(B)
= P(B|Ai)P(Ai)/[P(B|A1)P(A1)+…+P(B|An)P(An)]
– This allows us to compute P(Ai|B) based on P(B|Ai)
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
22
Interpretation of Bayes’ Rule
Hypothesis space: H={H1 , …, Hn}
P( H i | E ) 
Evidence: E
P( E | H i )P( H i )
P( E )
If we want to pick the most likely hypothesis H*, we can drop P(E)
Posterior probability of Hi
Prior probability of Hi
P ( H i | E )  P ( E | H i ) P( H i )
Likelihood of data/evidence
if Hi is true
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
23
Random Variable
• X: S   (“measure” of outcome)
– E.g., number of heads, all same face?, …
• Events can be defined according to X
– E(X=a) = {si|X(si)=a}
– E(Xa) = {si|X(si)  a}
• So, probabilities can be defined on X
– P(X=a) = P(E(X=a))
– P(aX) = P(E(aX))
• Discrete vs. continuous random variable
(think of “partitioning the sample space”)
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
24
An Example: Doc Classification
Sample Space S={x1,…, xn}
For 3 topics, four words, n=?
Topic
the
computer
game
baseball
X1: [sport
1
0
1
1]
X2: [sport
1
1
1
1]
X3: [computer
1
1
0
0]
X4: [computer
1
1
1
0]
X5: [other
0
0
1
1]
……
Conditional Probabilities:
P(Esport | Ebaseball ), P(Ebaseball|Esport),
P(Esport | Ebaseball, computer ), ...
Thinking in terms of random variables
Topic: T {“sport”, “computer”, “other”},
“Baseball”: B {0,1}, …
P(T=“sport”|B=1), P(B=1|T=“sport”), ...
Events
An inference problem:
Esport ={xi | topic(xi )=“sport”}
Suppose we observe that “baseball” is
mentioned, how likely the topic is about “sport”?
Ebaseball ={xi | baseball(xi )=1}
P(T=“sport”|B=1)  P(B=1|T=“sport”)P(T=“sport”)
Ebaseball,computer =
{xi | baseball(xi )=1 & computer(xi )=0}
But, P(B=1|T=“sport”)=?, P(T=“sport” )=?
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
25
Getting to Statistics ...
• P(B=1|T=“sport”)=? (parameter estimation)
– If we see the results of a huge number of random
experiments, then
count ( B  1, T " sport " )
Pˆ ( B  1 | T " sport " ) 
count (T " sport " )
– But, what if we only see a small sample (e.g., 2)? Is this
estimate still reliable?
• In general, statistics has to do with drawing
conclusions on the whole population based on
observations of a sample (data)
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
26
Parameter Estimation
• General setting:
– Given a (hypothesized & probabilistic) model that
governs the random experiment
– The model gives a probability of any data p(D|) that
depends on the parameter 
– Now, given actual sample data X={x1,…,xn}, what can
we say about the value of ?
• Intuitively, take your best guess of  -- “best” means
“best explaining/fitting the data”
• Generally an optimization problem
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
27
Maximum Likelihood vs. Bayesian
• Maximum likelihood estimation
– “Best” means “data likelihood reaches maximum”
ˆ  arg max P ( X |  )

– Problem: small sample
• Bayesian estimation
– “Best” means being consistent with our “prior”
knowledge and explaining data well
ˆ  arg max P( | X )  arg max P( X |  ) P( )


– Problem: how to define prior?
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
28
Illustration of Bayesian Estimation
Posterior:
p(|X) p(X|)p()
Likelihood:
p(X|)
X=(x1,…,xN)
Prior: p()

: prior mode
: posterior mode
2008 © ChengXiang Zhai
ml: ML estimate
Dragon Star Lecture at Beijing University, June 21-30, 2008
29
Maximum Likelihood Estimate
Data: a document d with counts c(w1), …, c(wN), and length |d|
Model: multinomial distribution M with parameters {p(wi)}
Likelihood: p(d|M)
Maximum likelihood estimator: M=argmax M p(d|M)
N
N
|d |

 N c ( wi )
p(d | M )  
  i c ( wi )
  i
i 1
 c( w1 )...c( wN )  i 1
where,i  p ( wi )

i 1
i
1
N
l (d | M )  log p(d | M )   c( wi ) log i
We’ll tune p(wi) to maximize l(d|M)
i 1
N
N
i 1
i 1
l (d | M )   c( wi ) log i   ( i  1)
Use Lagrange multiplier approach
'
l ' c( wi )

 0
i
i
N
Since
 i  
N
c( wi )

i  1,    c(wi )   | d |
i 1
i 1
Set partial derivatives to zero
So, i  p( wi ) 
2008 © ChengXiang Zhai
c( wi )
|d |
ML estimate
Dragon Star Lecture at Beijing University, June 21-30, 2008
30
What You Should Know
• Probability concepts:
– sample space, event, random variable, conditional
prob. multinomial distribution, etc
• Bayes formula and its interpretation
• Statistics: Know how to compute maximum likelihood
estimate
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
31
Essential Background 2:
Basic Concepts in Information Theory
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
32
Information Theory
• Developed by Shannon in the 40s
• Maximizing the amount of information that can be
transmitted over an imperfect communication
channel
• Data compression (entropy)
• Transmission rate (channel capacity)
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
33
Basic Concepts in Information Theory
• Entropy: Measuring uncertainty of a random variable
• Kullback-Leibler divergence: comparing two
distributions
• Mutual Information: measuring the correlation of two
random variables
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
34
Entropy: Motivation
•
•
•
Feature selection:
– If we use only a few words to classify docs, what kind of words
should we use?
– P(Topic| “computer”=1) vs p(Topic | “the”=1): which is more
random?
Text compression:
– Some documents (less random) can be compressed more than
others (more random)
– Can we quantify the “compressibility”?
In general, given a random variable X following distribution
p(X),
– How do we measure the “randomness” of X?
– How do we design optimal coding for X?
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
35
Entropy: Definition
Entropy H(X) measures the uncertainty/randomness of random variable X
H ( X )  H ( p)    p( x) log p( x)
  all possible values
x
Define 0 log 0  0, log  log 2
Example:
fair coin p( Head )  0.5
1

H ( X )  between 0 and 1
biased coin p( Head )  0.8
0
completely biased p( Head )  1

H(X)
P(Head)
1.0
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
36
Entropy: Properties
•
Minimum value of H(X): 0
– What kind of X has the
minimum entropy?
•
Maximum value of H(X): log M,
where M is the number of
possible values for X
– What kind of X has the
maximum entropy?
•
Related to coding
H(X)    p(x)log 2 p(x)
x
1
  p(x)log 2
p(x)
x

1 

 E  log 2
p(x) 

" Information of x "  "# bits to code x "   log p( x) H ( X )  E p [ log p( x)]
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
37
Interpretations of H(X)
•
Measures the “amount of information” in X
– Think of each value of X as a “message”
– Think of X as a random experiment (20 questions)
•
Minimum average number of bits to compress
values of X
– The more random X is, the harder to compress
A fair coin has the maximum information, and is hardest to compress
A biased coin has some information, and can be compressed to <1 bit on average
A completely biased coin has no information, and needs only 0 bit
" Information of x "  "# bits to code x "   log p( x) H ( X )  E p [ log p( x)]
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
38
Conditional Entropy
•
The conditional entropy of a random variable Y given
another X, expresses how much extra information one still
needs to supply on average to communicate Y given that
the other party knows X
H(Y | X) 
 p(x)H(Y | X  x)
x X

 p(x)  p(y | x) log p(y | x)
x X

y Y
  p(x, y) log p(y | x)
  Elog p(Y | X) 
x X y Y
•
H(Topic| “computer”) vs. H(Topic | “the”)?
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
39
Cross Entropy H(p,q)
What if we encode X with a code optimized for a wrong distribution q?
Expected # of bits=? H ( p, q)  E p [ log q( x)]    p( x) log q( x)
x
Intuitively, H(p,q)  H(p), and mathematically,
q ( x)
]
p
(
x
)
x
q ( x)
  log  [ p( x)
] 0
p
(
x
)
x
H ( p, q)  H ( p)   p( x)[ log
By Jensen ' s inequality :
 p f ( x )  f ( p x )
i
i
i
i i
i
where, f is a convex function, and
p
i
1
i
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
40
Kullback-Leibler Divergence D(p||q)
What if we encode X with a code optimized for a wrong distribution q?
How many bits would we waste? D( p || q)  H ( p, q)  H ( p)   p( x) log
x
Properties:
- D(p||q)0
- D(p||q)D(q||p)
- D(p||q)=0 iff p=q
p ( x)
q ( x)
Relative entropy
KL-divergence is often used to measure the
distance between two distributions
Interpretation:
-Fix p, D(p||q) and H(p,q) vary in the same way
-If p is an empirical distribution, minimize D(p||q) or H(p,q) is
equivalent to maximizing likelihood
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
41
Cross Entropy, KL-Div, and Likelihood
Data / Sample for X : Y  ( y1 ,..., yN )
1 if x  y
1 N
Empirical distribution : p( x)    ( yi , x)  ( y, x)  
N i 1
0 o.w.
~p ( x)  c( x) / N
N
Likelihood:
L(Y )   p ( X  yi )
i 1
N
log Likelihood:
log L(Y )   log p( X  yi )   c( x) log p ( X  x)  N  p ( x) log p ( x)
i 1
x
x
1
log L(Y )   H ( p, p)   D( p || p)  H ( p)
N
1
 log L (Y )
1
N
Fix the data, arg max p log L(Y )  arg min p H ( p, p)  arg min p D( p, p)  arg min p 2
N
Criterion for selecting a good model
2008 © ChengXiang Zhai
Perplexity(p)
Dragon Star Lecture at Beijing University, June 21-30, 2008
42
Mutual Information I(X;Y)
Comparing two distributions: p(x,y) vs p(x)p(y)
I ( X ; Y )   p ( x, y ) log
x, y
p ( x, y )
 H ( X )  H ( X | Y )  H (Y )  H (Y | X )
p( x) p( y )
Properties: I(X;Y)0; I(X;Y)=I(Y;X); I(X;Y)=0 iff X & Y are independent
Interpretations:
- Measures how much reduction in uncertainty of X given info. about Y
- Measures correlation between X and Y
- Related to the “channel capacity” in information theory
Examples:
I(Topic; “computer”) vs. I(Topic; “the”)?
I(“computer”, “program”) vs (“computer”, “baseball”)?
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
43
What You Should Know
• Information theory concepts: entropy, cross entropy,
relative entropy, conditional entropy, KL-div., mutual
information
– Know their definitions, how to compute them
– Know how to interpret them
– Know their relationships
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
44
Essential Background 3:
Natural Language Processing
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
45
What is NLP?
… ‫صا ِدقَا ً َم َع نَ ْف ِس ِه َو َم َع أَ ْه ِل ِه َو ِجي َْرا ِن ِه َوأَ ْن َي ْبذُ َل‬
ُ ‫َي ِج‬
َ ‫ب‬
َ ‫علَى اإل ْن‬
َ ‫ان أن َي ُكونَ ِأم ْينَا ً َو‬
ِ ‫س‬
َ ‫الو‬
‫علَى َما‬
ِ ‫… ُك َّل ُج ْه ٍد ِفي ِإع‬
َ ‫ط ِن َوأَ ْن َي ْع َم َل‬
َ ‫ْالء شَأ ْ ِن‬
Arabic text
Spanish text La listas actualizadas figuran como Aneio I.
How can a computer make sense out of such a string
?
- What are the basic units of meaning (words)?
Morphology
- What is the meaning of each word?
Syntax - How are words related with each other?
Semantics - What is the “combined meaning” of words?
Pragmatics - What is the “meta-meaning”? (speech act)
Discourse - Handling a large chunk of text
Inference - Making sense of everything
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
46
An Example of NLP
A dog is chasing a boy on the playground
Det
Noun Aux
Noun Phrase
Verb
Complex Verb
Semantic analysis
Dog(d1).
Boy(b1).
Playground(p1).
Chasing(d1,b1,p1).
+
Det Noun Prep Det
Noun Phrase
Noun
Noun Phrase
Lexical
analysis
(part-of-speech
tagging)
Prep Phrase
Verb Phrase
Syntactic analysis
(Parsing)
Verb Phrase
Sentence
Scared(x) if Chasing(_,x,_).
A person saying this may
be reminding another person to
get the dog back…
Scared(b1)
Inference
Pragmatic analysis
(speech act)
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
47
If we can do this for all the
sentences, then …
BAD NEWS:
Unfortunately, we can’t.
General NLP = “AI-Complete”
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
48
NLP is Difficult!
• Natural language is designed to make human
communication efficient. As a result,
– we omit a lot of “common sense” knowledge, which
we assume the hearer/reader possesses
– we keep a lot of ambiguities, which we assume the
hearer/reader knows how to resolve
• This makes EVERY step in NLP hard
– Ambiguity is a “killer”
– Common sense reasoning is pre-required
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
49
Examples of Challenges
• Word-level ambiguity: E.g.,
– “design” can be a noun or a verb (Ambiguous POS)
– “root” has multiple meanings (Ambiguous sense)
• Syntactic ambiguity: E.g.,
– “natural language processing” (Modification)
– “A man saw a boy with a telescope.” (PP Attachment)
• Anaphora resolution: “John persuaded Bill to buy a
TV for himself.” (himself = John or Bill?)
• Presupposition: “He has quit smoking.” implies that
he smoked before.
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
50
Despite all the challenges,
research in NLP has also made
a lot of progress…
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
51
High-level History of NLP
•
Early enthusiasm (1950’s): Machine Translation
–
–
•
•
Less ambitious applications (late 1960’s & early 1970’s): Limited success, failed to scale
up
–
–
–
Speech recognition
Deep understanding in
Dialogue (Eliza)
Shallow understanding
Inference and domain knowledge (SHRDLU=“block world”) limited domain
Real world evaluation (late 1970’s – now)
–
–
–
•
Too ambitious
Bar-Hillel report (1960) concluded that fully-automatic high-quality translation could not be
accomplished without knowledge (Dictionary + Encyclopedia)
Story understanding (late 1970’s & early 1980’s) Knowledge representation
Large scale evaluation of speech recognition, text retrieval, information extraction (1980 –
now)
Robust component techniques
Statistical approaches enjoy more success (first in speech recognition & retrieval, later
others)
Stat. language models
Current trend:
–
–
–
Boundary between statistical and symbolic approaches is disappearing.
Applications
We need to use all the available knowledge
Application-driven NLP research (bioinformatics, Web, Question answering…)
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
52
The State of the Art
A dog is chasing a boy on the playground
Det
Noun Aux
Noun Phrase
Verb
Complex Verb
Det Noun Prep Det
Noun Phrase
Noun
POS
Tagging:
97%
Noun Phrase
Prep Phrase
Verb Phrase
Parsing: partial >90%(?)
Semantics: some aspects
Verb Phrase
- Entity/relation extraction
- Word sense disambiguation
- Anaphora resolution
Sentence
Speech act analysis: ???
Inference: ???
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
53
Technique Showcase: POS Tagging
Training data (Annotated text)
This sentence
Det
N
annotated
text…
V2
N
“This is a new sentence”
Consider all possibilities,
and pick the one with
the highest probability
serves as an example of
V1
P Det
N
P
This is
a new sentence
Det Aux Det Adj
N
POS Tagger
new
sentence Method 1: Independent assignment
This is
a
Det Det
……
Det Aux
……
V2 V2
Det Det
Det
Det Adj
N
V2
V2
V2
Most common tag
p( w1 ,..., wk , t1 ,..., tk )
 p(t1 | w1 )... p(tk | wk ) p( w1 )... p( wk )

 k
 p( wi | ti ) p(ti | ti 1 )
 i 1
Method 2: Partial dependency
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
54
Technique Showcase: Parsing
Grammar
Lexicon
1.0
S NP VP
0.3
NP  Det BNP
0.4
NP  BNP
0.3
NP NP PP
BNP N
…
VP  V
VP  Aux V NP
…
VP  VP PP
PP  P NP 1.0
Generate
V  chasing 0.01
Aux is
N  dog 0.003
N  boy
N playground …
Det the
…
Det a
P  on
S
Probability of this tree=0.000015
NP
VP
Det
BNP
A
N
VP
Aux
dog
PP
V
is chasing
NP
P
NP
on
a boy
the playground
S
NP
Det
A
VP
BNP
N
Aux
is
Choose a tree with
highest prob….
NP
V
PP
chasing NP
P
dog
Can also be treated as a classification/decision problem…
a boy
NP
on
the playground
roller skates
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
55
Semantic Analysis Techniques
• Only successful for VERY limited domain or for SOME
aspect of semantics
• E.g.,
– Entity extraction (e.g., recognizing a person’s name):
Use rules and/or machine learning
– Word sense disambiguation: addressed as a
classification problem with supervised learning
– Anaphora resolution …
In general, exploiting machine learning
and statistical language models…
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
56
What We Can’t Do
• 100% POS tagging
– “He turned off the highway.” vs “He turned off the fan.”
• General complete parsing
– “A man saw a boy with a telescope.”
• Deep semantic analysis
– Will we ever be able to precisely define the meaning of “own” in “John
owns a restaurant.”?
Robust & general NLP tends to be “shallow”,
while “deep” understanding doesn’t scale up …
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
57
Major NLP Applications
•
•
Speech recognition: e.g., Auto telephone call routing
Text management
–
–
–
–
•
•
•
Text retrieval/filtering
Text classification
Text summarization
Text mining
Our focus
– Query answering
Language tutoring
– Spelling/grammar correction
Machine translation
– Cross-language retrieval
– Restricted natural language
Natural language user interface
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
58
NLP & Text Management
Better NLP => Better Text Management
Bad NLP => Bad Text Management?
Robust, shallow NLP tends to be more useful than deep, but
fragile NLP.
Errors in NLP can hurt text management performance…
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
59
How Much NLP is Really Needed?
Tasks
Dependency on NLP
“Easier” &
More “workarounds”
Classification/
Retrieval
Summarization/
Extraction/
Mining
Translation/
Dialogue
Question
Answering
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
60
Workaround Example I: ELIZA
ELIZA plays the role of a therapist
How does it work?
Person : Men are all alike.
ELIZA : In what way?
Person : They're always bugging us about something or other.
ELIZA : Can you think of a specific example?
Person : Well, my boyfriend made me come here.
ELIZA : Your boyfriend made you come here?
Person : He says I'm depressed much of the time.
ELIZA : I am sorry to hear that you are depressed.
Person : It's true. I am unhappy.
ELIZA : Do you think coming here will help you not to be
unhappy?
Person : I need some help, that much seems certain.
ELIZA : What would it mean to you if you got some help?
Person : Perhaps I could learn to get along with my mother.
ELIZA : Tell me more about your family
Heuristic rules based on
Keyword matching! E.g.,
I remember X =>
Do you often think of X?
always =>
Can you think of a specific example?
It keeps the dialogue going on,
but NEVER makes any claim!
A breakdown example:
Person : Failure is the mother of success.
ELIZA : Tell me more about your family
No NLP, but useful. Perhaps we should call this NLP?
Statistical NLP often has a similar flavor with “SOFT” rules LEARNED from data
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
61
Workaround Example II:
Statistical Translation
Learn how to translate Chinese to English from many example translations
Intuitions:
- If we have seen all possible translations, then we simply lookup
- If we have seen a similar translation, then we can adapt
- If we haven’t seen any example that’s similar, we try to generalize what we’ve seen
All these intuitions are captured through a probabilistic model
English
Speaker
P(E)
English
Words (E)
Chinese
Words(C)
Noisy
Channel
Translator
P(C|E)
P(E|C)=?
2008 © ChengXiang Zhai
English
Translation
Dragon Star Lecture at Beijing University, June 21-30, 2008
62
So, what NLP techniques are most
useful for text management?
Statistical NLP in general, and
statistical language models in particular
The need for high robustness and efficiency
implies the dominant use of
simple models (i.e., unigram models)
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
63
What You Should Know
•
•
•
•
NLP is the basis for text management
– Better NLP enables better text management
– Better NLP is necessary for sophisticated tasks
But
– Bad NLP doesn’t mean bad text management
– There are often “workarounds” for a task
– Inaccurate NLP can even hurt the performance of a task
The most effective NLP techniques are often statistical with
the help of linguistic knowledge
The challenge is to bridge the gap between NLP and
applications
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
64
Roadmap
• Today’s lecture
– Course overview
– Essential background (prob & stat, info theory, NLP)
• Next two lectures: overview of IR
– Basic concepts
– Evaluation
– Brief history
– Basic models
–…
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
65