Automatically Building Special Purpose Search Engines with

Download Report

Transcript Automatically Building Special Purpose Search Engines with

Natural Language
Processing & Information
Extraction
CSE 573
Dan Weld
todo
• Be more explicit on how rule learning works
(before rapier, bwi) – show search thru rule
space
• Too much of a lecture – add more interaction
/ questions
Confession
• Proposal 1
– Keep homework and final exam
– Teach next week so we can cover all the cool stuff
• Proposal 2
– Talk quickly
– Mini-project replaces homework & final, due 12/11
Mini-Project Options
1. Write a program to solve the counterfeit coin
problem on the midterm.
2. Build a DPLL and or a WalkSAT satisfiability
solver.
3. Build a spam filter using naïve Bayes,
decision trees, or compare learners in the
Weka ML package.
4. Write a program which learns Bayes nets.
Communication
Is the Problem
Of the century…
…
Why Study NLP?
• A hallmark of human intelligence.
• Text is the largest repository of human
knowledge and is growing quickly.
– emails, news articles, web pages, IM, scientific
articles, insurance claims, customer complaint letters,
transcripts of phone calls, technical documents,
government documents, patent portfolios, court
decisions, contracts, ……
What is NLP?
• Natural Language Processing
– Understand human language (speech, text)
– Successfully communicate (NL generation)
• Pragmatics: “Can you pass the salt?”
• Semantics: “Paris is the capital of France”
– How represent meaning?
• Syntax: analyze the structure of a sentence
• Morphology: analyze the structure of words
– Dogs = dog + s.
Fundamental NLP Tasks
• Speech recognition: speech signal to words
• Parsing: decompose sentence into units
– Part of speech tagging: Eg, noun vs. verb
• Semantic role labeling: for a verb, find the units that
fill pre-defined “semantic roles” (eg, Agent, Patient or
Instrument)
– Example: “John hit the ball with the bat”
Fundamental NLP Tasks
• Semantics: map sentence to corresponding
“meaning representation” (e.g., logic)
– give(John, Book, Sally)
– Quantification: Everybody loves somebody
• Word Sense Disambiguation
– orange juice vs. orange coat
– Where do word senses come from?
• Co-reference resolution:
– The dog found a cookie; He ate it.
• Implicit “text” - what is left unsaid?
– Joe entered the restaurant, ordered, ate and left. The
owner said “Come back soon!”
– Joe entered the restaurant, ordered, ate and left. The
owner said “Hey, I’ll call the police!”
NLP Applications
• Question answering
– Who is the first Taiwanese president?
• Text Categorization/Routing
– e.g., customer e-mails.
• Text Mining
– Find everything that interacts with BRCA1.
• Machine Translation
• Information Extraction
• Spelling correction
– Is that just dictionary lookup?
Main Challenge in NLP: Ambiguity
• Lexical:
– Label (noun or verb)?
– London (Jack or capital of the UK)?
• Syntactic
(examples from newspaper headlines):
– Prepositional Phrase Attachment
• Ban on Nude Dancing on Governor’s Desk
• Killer Sentenced to Die for Second Time in 10 Years
– Word Sense Disambiguation
• 1990: Iraqi Head Seeking Arms
– Syntactic Ambiguity (what’s modifying what)
• Juvenile Court to Try Shooting Defender.
• Semantic ambiguity:
– “snake poison”
• Rampant metaphors:
– “prices went up”
Speech Act Theory (Searle ’75)
• Example Speech Acts:
– assertives = speech acts that commit a speaker to the truth
of the expressed proposition
– directives = speech acts that are to cause the hearer to take
a particular action, e.g. requests, commands and advice
– commissives = speech acts that commit a speaker to some
future action, e.g. promises and oaths
• “Could you pass the salt?”
– “Yes.”
• Research has focused on specifying semantics and
analyzing primitives instead of implemented “speech
actors”
Semantics
• How do you map a sentence to a semantic representation?
– What are the semantic primitives?
• Schank: 12 (or so) primitives
– (Conceptual dependency theory, 1972)
• The basis of natural language is conceptual.
• The conceptual level is interlingual,
• while the sentential level is language-specific.
– De-emphasize syntax
– Focus on semantic expectations as the basis for NLU
• Status: didn’t yield practical systems, out of favor, but..
Syntax is about Sentence Structures
• Sentences have structures and are
made up of constituents.
– The constituents are phrases.
– A phrase consists of a head and modifiers.
• The category of the head determines
the category of the phrase
– e.g., a phrase headed by a noun is a noun
phrase
Alternative Parses
Teacher strikes idle kids
S
S
VP
NP
NP
N
VP
NP
N
V
N
Teacher strikes idle kids
NP
N
V
A
N
Teacher strikes idle kids
How do you choose the parse?
• Old days: syntactic rules, semantics.
• Since early 1980’s: statistical parsing.
– parsing as supervised learning
• Input: Tree bank of (sentence, tree)
• Output: PCFG Parser + lexical language model
•
•
•
•
Parsing decisions depend on the words!
Example: VP  V + NP + NP?
Does the verb take two objects or one?
‘told Sue the story’ versus ‘invented the Internet’
Issues with Statistical Parsing
• Statistical parsers still make “plenty” of errors
• Tree banks are language specific
• Tree banks are genre specific
– Train on WSJ  fail on the Web
– standard distributional assumption
• Unsupervised, un-lexicalized parsers exist
– But performance is substantially weaker
How Difficult is Morphology?
• Examples from Woods et. al. 2000
–
–
–
–
–
–
delegate (de + leg + ate) take the legs from
caress (car + ess) female car
cashier (cashy + er) more wealthy
lacerate (lace + rate) speed of tatting
ratify (rat + ify) infest with rodents
infantry (infant + ry) childish behavior
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
NAME
TITLE
ORGANIZATION
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
IE
NAME
Bill Gates
Bill Veghte
Richard Stallman
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + clustering + association
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment
Classify
Associate
Cluster
Load DB
Document
collection
Train extraction models
Label training data
Database
Query,
Search
Data mine
Slides from Cohen & McCallum
IE History
Pre-Web
• Mostly news articles
– De Jong’s FRUMP [1982]
• Hand-built system to fill Schank-style “scripts” from news wire
– Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92’96]
• Most early work dominated by hand-built models
– E.g. SRI’s FASTUS, hand-built FSMs.
– But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then
HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]
Web
• AAAI ’94 Spring Symposium on “Software Agents”
– Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.
• Tom Mitchell’s WebKB, ‘96
– Build KB’s from the Web.
• Wrapper Induction
– First by hand, then ML: [Doorenbos ‘96], [Soderland ’96], [Kushmerick ’97],…
Slides from Cohen & McCallum
What makes IE from the Web Different?
Less grammar, but more formatting & linking
Newswire
Web
www.apple.com/retail
Apple to Open Its First Retail Store
in New York City
MACWORLD EXPO, NEW YORK--July 17, 2002-Apple's first retail store in New York City will open in
Manhattan's SoHo district on Thursday, July 18 at
8:00 a.m. EDT. The SoHo store will be Apple's
largest retail store to date and is a stunning example
of Apple's commitment to offering customers the
world's best computer shopping experience.
www.apple.com/retail/soho
www.apple.com/retail/soho/theatre.html
"Fourteen months after opening our first retail store,
our 31 stores are attracting over 100,000 visitors
each week," said Steve Jobs, Apple's CEO. "We
hope our SoHo store will surprise and delight both
Mac and PC users who want to see everything the
Mac can do to enhance their digital lifestyles."
The directory structure, link structure,
formatting & layout of the Web is its own
new grammar.
Slides from Cohen & McCallum
Landscape of IE Tasks (1/4):
Pattern Feature Domain
Text paragraphs
without formatting
Grammatical sentences
and some formatting & links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.
Non-grammatical snippets,
rich formatting & links
Tables
Slides from Cohen & McCallum
Landscape of IE Tasks (2/4):
Pattern Scope
Web site specific
Formatting
Amazon.com Book Pages
Genre specific
Layout
Resumes
Wide, non-specific
Language
University Names
Slides from Cohen & McCallum
Landscape of IE Tasks (3/4):
Pattern Complexity
E.g. word patterns:
Closed set
Regular set
U.S. states
U.S. phone numbers
He was born in Alabama…
Phone: (413) 545-1323
The big Wyoming sky…
The CALD main office can be
reached at 412-268-1299
Complex pattern
U.S. postal addresses
University of Arkansas
P.O. Box 140
Hope, AR 71802
Headquarters:
1128 Main Street, 4th Floor
Cincinnati, Ohio 45210
Ambiguous patterns,
needing context + many
sources of evidence
Person names
…was among the six houses
sold by Hope Feldman that year.
Pawel Opalinski, Software
Engineer at WhizBang Labs.
Slides from Cohen & McCallum
Landscape of IE Tasks (4/4):
Pattern Combinations
Jack Welch will retire as CEO of General Electric tomorrow. The top role
at the Connecticut company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
Person: Jack Welch
Relation: Person-Title
Person: Jack Welch
Title:
CEO
Person: Jeffrey Immelt
Location: Connecticut
N-ary record
Relation:
Company:
Title:
Out:
In:
Succession
General Electric
CEO
Jack Welsh
Jeffrey Immelt
Relation: Company-Location
Company: General Electric
Location: Connecticut
“Named entity” extraction
Slides from Cohen & McCallum
Evaluation of Single Entity Extraction
TRUTH:
Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.
PRED:
Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.
# correctly predicted segments
Precision =
2
=
# predicted segments
6
# correctly predicted segments
Recall
=
2
=
# true segments
4
1
F1
=
Harmonic mean of Precision & Recall =
((1/P) + (1/R)) / 2
Slides from Cohen & McCallum
State of the Art Performance
• Named entity recognition
– Person, Location, Organization, …
– F1 in high 80’s or low- to mid-90’s
• Binary relation extraction
– Contained-in (Location1, Location2)
Member-of (Person1, Organization1)
– F1 in 60’s or 70’s or 80’s
• Wrapper induction
– Extremely accurate performance obtainable
– Human effort (~30min) required on each site
Slides from Cohen & McCallum
Landscape of IE Techniques (1/1):
Models
Classify Pre-segmented
Candidates
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama
Alaska
…
Wisconsin
Wyoming
Boundary Models
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
Classifier
which class?
which class?
Try alternate
window sizes:
Finite State Machines
Abraham Lincoln was born in Kentucky.
Context Free Grammars
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
NNP
NNP
V
V
P
Classifier
PP
which class?
VP
NP
BEGIN
END
BEGIN
NP
END
VP
S
…and beyond
Any of these models can be used to capture words, formatting or
both.
Slides from Cohen & McCallum
Landscape: Our Focus
Pattern complexity
Pattern feature domain
Pattern scope
Pattern combinations
Models
closed set
words
regular
complex
words + formatting
site-specific
formatting
genre-specific
entity
binary
lexicon
regex
ambiguous
general
n-ary
window
boundary
FSM
CFG
Slides from Cohen & McCallum
Sliding Windows
Slides from Cohen & McCallum
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen & McCallum
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen & McCallum
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen & McCallum
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen & McCallum
A “Naïve Bayes” Sliding Window Model
[Freitag 1997]
…
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun …
w t-m
w t-1 w t
w t+n
w t+n+1
w t+n+m
prefix
contents
suffix
Estimate Pr(LOCATION|window) using Bayes rule
Try all “reasonable” windows (vary length, position)
Assume independence for length, prefix words, suffix words, content words
Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)
If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.
Other examples of sliding window: [Baluja et al 2000]
(decision tree over individual words & their context)
Slides from Cohen & McCallum
“Naïve Bayes” Sliding Window Results
Domain: CMU UseNet Seminar Announcements
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during
the 1980s and 1990s.
As a result of its
success and growth, machine learning is
evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning), genetic
algorithms, connectionist learning, hybrid
systems, and so on.
Field
Person Name:
Location:
Start Time:
F1
30%
61%
98%
Slides from Cohen & McCallum
Realistic sliding-window-classifier IE
• What windows to consider?
– all windows containing as many tokens as the
shortest example, but no more tokens than the
longest example
• How to represent a classifier? It might:
– Restrict the length of window;
– Restrict the vocabulary or formatting used
before/after/inside window;
– Restrict the relative order of tokens, etc.
• Learning Method
– SRV: Top-Down Rule Learning
[Frietag AAAI ‘98]
– Rapier: Bottom-Up
[Califf & Mooney, AAAI ‘99]
Slides from Cohen & McCallum
Rapier: results – precision/recall
Slides from Cohen & McCallum
Rule-learning approaches to slidingwindow classification: Summary
• SRV, Rapier, and WHISK [Soderland KDD ‘97]
– Representations for classifiers allow restriction of
the relationships between tokens, etc
– Representations are carefully chosen subsets of
even more powerful representations based on
logic programming (ILP and Prolog)
– Use of these “heavyweight” representations is
complicated, but seems to pay off in results
• Can simpler representations for classifiers
work?
Slides from Cohen & McCallum
BWI: Learning to detect boundaries
[Freitag & Kushmerick, AAAI 2000]
• Another formulation: learn three probabilistic
classifiers:
– START(i) = Prob( position i starts a field)
– END(j) = Prob( position j ends a field)
– LEN(k) = Prob( an extracted field has length k)
• Then score a possible extraction (i,j) by
START(i) * END(j) * LEN(j-i)
• LEN(k) is estimated from a histogram
Slides from Cohen & McCallum
BWI: Learning to detect boundaries
• BWI uses boosting to find “detectors” for
START and END
• Each weak detector has a BEFORE and
AFTER pattern (on tokens before/after
position i).
• Each “pattern” is a sequence of
– tokens and/or
– wildcards like: anyAlphabeticToken, anyNumber, …
• Weak learner for “patterns” uses greedy
search (+ lookahead) to repeatedly extend a
pair of empty BEFORE,AFTER patterns
Slides from Cohen & McCallum
BWI: Learning to detect boundaries
Field
Person Name:
Location:
Start Time:
F1NB
30%
61%
98%
F1BWI
69.3%
75.5%
99.6%
Slides from Cohen & McCallum
Problems with Sliding Windows
and Boundary Finders
• Decisions in neighboring parts of the input
are made independently from each other.
– Naïve Bayes Sliding Window may predict a
“seminar end time” before the “seminar start time”.
– It is possible for two overlapping windows to both
be above threshold.
– In a Boundary-Finding system, left boundaries are
laid down independently from right boundaries,
and their pairing happens as a separate step.
Slides from Cohen & McCallum
Rule Learning
• Bootstrap
– <class nameplural> “such as” <proper NP>+
– Use names to label
other pages
– Now have training data!
– Learn more rules:
– <proper NP>, mayor of <proper NP>
Hidden Markov Models
Conditional Random Fields
Finite State Machines
Abraham Lincoln was born in Kentucky.
Most likely state sequence?
References
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
[Bikel et al 1997] Bikel, D.; Miller, S.; Schwartz, R.; and Weischedel, R. Nymble: a high-performance learning name-finder. In
Proceedings of ANLP’97, p194-201.
[Califf & Mooney 1999], Califf, M.E.; Mooney, R.: Relational Learning of Pattern-Match Rules for Information Extraction, in
Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99).
[Cohen, Hurst, Jensen, 2002] Cohen, W.; Hurst, M.; Jensen, L.: A flexible learning system for wrapping tables and lists in HTML
documents. Proceedings of The Eleventh International World Wide Web Conference (WWW-2002)
[Cohen, Kautz, McAllester 2000] Cohen, W; Kautz, H.; McAllester, D.: Hardening soft information sources. Proceedings of the
Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000).
[Cohen, 1998] Cohen, W.: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual
Similarity, in Proceedings of ACM SIGMOD-98.
[Cohen, 2000a] Cohen, W.: Data Integration using Similarity Joins and a Word-based Information Representation Language,
ACM Transactions on Information Systems, 18(3).
[Cohen, 2000b] Cohen, W. Automatically Extracting Features for Concept Learning from the Web, Machine Learning:
Proceedings of the Seventeeth International Conference (ML-2000).
[Collins & Singer 1999] Collins, M.; and Singer, Y. Unsupervised models for named entity classification. In Proceedings of the
Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.
[De Jong 1982] De Jong, G. An Overview of the FRUMP System. In: Lehnert, W. & Ringle, M. H. (eds), Strategies for Natural
Language Processing. Larence Erlbaum, 1982, 149-176.
[Freitag 98] Freitag, D: Information extraction from HTML: application of a general machine learning approach, Proceedings of the
Fifteenth National Conference on Artificial Intelligence (AAAI-98).
[Freitag, 1999], Freitag, D. Machine Learning for Information Extraction in Informal Domains. Ph.D. dissertation, Carnegie Mellon
University.
[Freitag 2000], Freitag, D: Machine Learning for Information Extraction in Informal Domains, Machine Learning 39(2/3): 99-101
(2000).
Freitag & Kushmerick, 1999] Freitag, D; Kushmerick, D.: Boosted Wrapper Induction. Proceedings of the Sixteenth National
Conference on Artificial Intelligence (AAAI-99)
[Freitag & McCallum 1999] Freitag, D. and McCallum, A. Information extraction using HMMs and shrinakge. In Proceedings
AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI Technical Report WS-99-11.
[Kushmerick, 2000] Kushmerick, N: Wrapper Induction: efficiency and expressiveness, Artificial Intelligence, 118(pp 15-68).
[Lafferty, McCallum & Pereira 2001] Lafferty, J.; McCallum, A.; and Pereira, F., Conditional Random Fields: Probabilistic Models
for Segmenting and Labeling Sequence Data, In Proceedings of ICML-2001.
[Leek 1997] Leek, T. R. Information extraction using hidden Markov models. Master’s thesis. UC San Diego.
[McCallum, Freitag & Pereira 2000] McCallum, A.; Freitag, D.; and Pereira. F., Maximum entropy Markov models for information
extraction and segmentation, In Proceedings of ICML-2000
[Miller et al 2000] Miller, S.; Fox, H.; Ramshaw, L.; Weischedel, R. A Novel Use of Statistical Parsing to Extract Information from
Text. Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL), p. 226 - 233.
Slides from Cohen & McCallum
Introduction to NLP
Oren Etzioni
(CSE 473)
Communication
Is the Problem
Of the century…
…
Why Study NLP?
• A hallmark of human intelligence.
• Text is the largest repository of human
knowledge and is growing quickly.
– emails, news articles, web pages, IM, scientific
articles, insurance claims, customer complaint letters,
transcripts of phone calls, technical documents,
government documents, patent portfolios, court
decisions, contracts, ……
References
• Russell & Norvig ch. 22 & 23
• Daniel Jurafsky and James H. Martin, Speech
and Language Processing
• Foundations of Statistical NLP (Manning and
Schutze)
What is NLP?
• Natural Language Processing
– Understand human language (speech, text)
– Successfully communicate (NL generation)
• Pragmatics: “Can you pass the salt?”
• Semantics: “Paris is the capital of France”
– “show is white” iff what holds true in the world?
– What does ‘water’ refer to? On ‘twin earth’?
• Syntax: analyze the structure of a sentence
• Morphology: analyze the structure of words
– Dogs = dog + s.
Why is Natural Language Understanding
difficult?
• The hidden structure of language is highly
ambiguous
• Structures for: Fed raises interest rates 0.5%
in effort to control inflation (NYT headline 5/17/00)
Where are the ambiguities?
Fundamental NLP Tasks
• Speech recognition: speech signal to words
• Parsing: decompose sentence into units
– Part of speech tagging: Eg, noun vs. verb
• Semantic role labeling: for a verb, find the units that
fill pre-defined “semantic roles” (eg, Agent, Patient or
Instrument)
– Example: “John hit the ball with the bat”
Fundamental NLP Tasks
• Semantics: map sentence to corresponding
“meaning representation” (e.g., logic)
– give(John, Book, Sally)
– Quantification: Everybody loves somebody
• Word Sense Disambiguation
– orange juice vs. orange coat
– Where do word senses come from?
• Co-reference resolution: The dog found a cookie;
He ate it.
• Implicit “text”: what is left unsaid?
NLP Applications
• Question answering
– Who is the first Taiwanese president?
• Text Categorization/Routing
– e.g., customer e-mails.
• Text Mining
– Find everything that interacts with BRCA1.
• Machine Translation
• Information Extraction
• Spelling correction
– Is that just dictionary lookup?
Challenges in NLP: Ambiguity
• Lexical: label (noun or verb)?
– London (Jack or capital of the UK)?
• Syntactic:
– I smelled a wumpus in (2,2)
– Teacher Strikes Idle Kids
– Killer Sentenced to Die for Second Time in 10
Years
• Semantic ambiguity: “dog lover”
• Rampant metaphors: “prices went up”
Pragmatics
How do you map an utterance to the intended
meaning?
• Do you have a standard set of
transformations?
• Do you have a model of beliefs & agents?
– He knows that I can pass the salt, therefore…
• Can this be formulated as a learning
problem?
Speech Act Theory (Searle ’75)
• Example Speech Acts:
– assertives = speech acts that commit a speaker to the truth
of the expressed proposition
– directives = speech acts that are to cause the hearer to take
a particular action, e.g. requests, commands and advice
– commissives = speech acts that commit a speaker to some
future action, e.g. promises and oaths
• Research has focused on specifying semantics and
analyzing primitives instead of implemented “speech
actors”
Semantics
• How do you map a sentence to a semantic representation?
– What are the semantic primitives? (Schank’s conceptual
dependency theory, 1972)
• Schank: 12 (or so) primitives:
• The basis of natural language is conceptual.
• The conceptual level is interlingual while the sentential level is
language-specific.
• De-emphasize syntax
• Focus on semantic expectations as the basis for NLU
• Status: didn’t yield practical systems, out of favor, but..
Syntax is about Sentence Structures
• Sentences have structures and are made up
of constituents.
• The constituents are phrases.
• A phrase consists of a head and modifiers.
• The category of the head determines the
category of the phrase
– e.g., a phrase headed by a noun is a noun phrase
Parsing
• Analyze the structure of a sentence
S
VP
NP
PP
NP
D
The
N
V
D
NP
N
student put the book on
P
the
D
table
N
Alternative Parses
S
S
VP
NP
NP
N
VP
NP
N
V
N
Teacher strikes idle kids
NP
N
V
A
N
Teacher strikes idle kids
How do you choose the parse?
• Old days: syntactic rules, semantics.
• Since early 1980’s: statistical parsing.
– parsing as supervised learning
• input: Tree bank of (sentence, tree)
• output: PCFG Parser + lexical language model
•
•
•
•
Parsing decisions depend on the words!
Example: VP  V + NP + NP?
Does the verb take two objects or one?
‘told Sue the story’ versus ‘invented the Internet’
Issues with Statistical Parsing
• Statistical parsers still make “plenty” of errors
– Have the report on my desk by Monday
• Tree banks are language specific
• Tree banks are genre specific
– Train on WSJ  fail on the Web
– standard distributional assumption
• Unsupervised, un-lexicalized parsers exist, but
performance is substantially weaker
How Difficult is Morphology?
• Examples from Woods et. al. 2000
–
–
–
–
–
–
delegate (de + leg + ate) take the legs from
caress (car + ess) female car
cashier (cashy + er) more wealthy
lacerate (lace + rate) speed of tatting
ratify (rat + ify) infest with rodents
infantry (infant + ry) childish behavior
NLP Application Case Study: Machine
Translation
• Focus is on translating single sentences
using a probabilistic model
• Classical case of statistical NLP (both good &
bad)
• Source sentence is E (English)
• Target sentence is F (French)
• What is P(F/E)?
Noisy Channel Model
•
•
•
•
argmax over F, P(F/E)
P(F/E) = [P(E/F) * P(F)]/P(E)
Ignore P(E). Why?
P(F) = Language model of French acquired from a French
corpus
• P(E/F) = “translation model” acquired from a sentence “aligned”
English-French corpus
• What’s the point? Instead of estimating P(F/E) directly, we
decompose the problem into
– find possible translations of E into French
– Rank them using a model of French
Language Models
• To get a probability distribution over sentences, build
an N-gram model
P( f 1.. fn)   P( fi / fi  1)
• P(go/let's) is high but P(lets'/go) is low
• Equation shows a bi-gram model, tri-grams are
common
• N is a tradeoff based on corpus size, sparsity
• Used widely in NLP (e.g., speech)
Translation Model P(E/F)
Relies on corpus where French sentences paired with their English
translations (E.g., European parliament proceedings)
• Corpus too sparse to translate whole sentences
• So again, decompose into N-grams
P(the dog/le chien) = P(the/le) * P(dog/chien)
• How does this fail?
– word order: brown dog  chien brun
– one-to-many: house  a la maison
– Idioms (he “kicked the bucket”), metaphors, etc.
• How does this scale to 49,000,000 language pairs?
Most statistical machine translation research
has focused on a few high-resource languages
(European, Chinese, Japanese, Arabic).
Approximate
Parallel Text Available
(with English)
(~200M words)
Various
Western European
languages:
parliamentary
proceedings,
govt documents
(~30M words)
Chinese French Arabic
Italian Danish Finnish
…
Serbian
Uzbek
Bengali
Nothing/
Univ. Decl.
Of Human
Rights
(~1K words)
{
…
{
u
Bible/Koran/
Book of Mormon/
Dianetics
(~1M words)
…
Chechen
Khmer
Conclusions
• NLP is harder than it might seem naively
• Many subtasks
• Statistical NLP is the dominant paradigm
– supervised learning
– corpus-based statistics (language models)
– Some important limitations in practice!
• NL “understanding” has received very little
attention
Why is NLP Hard?
• Ambiguity
– Prepositional Phrase Attachment
• Newspaper headline examples
• Ban on Nude Dancing on Governor’s Desk
– Word Sense Disambiguation
• 1990: Iraqi Head Seeking Arms
– Syntactic Ambiguity (what’s modfying what)
• Juvenile Court to Try Shooting Defender.
Examples From Chris Manning
• More
–
–
–
–
–
–
Teachers Strikes Idle Kids;
Stolen Painting Found by Tree.
Local High School Dropouts Cut in Half;
Red Tape Holds Up New Bridges;
Clinton Wins on Budget, But More Lies Ahead;
Hospitals are Sued by Seven Foot Doctors; Kids Make
Nutritious Snacks;
– Minister Accused of Having Eight Wives in Jail.
• That the reason that they work for human beings is
because natural language understanding essentially
depends on making complex and subtle use of
context, both the context of the rest of the sentence,
and context of prior discourse and thing around you
in the world. So for humans it’s obvious what people
are saying.
• AI complete tasks
Chart Parsing
• R&N p 800
Small devices
• With a big monitor, humans can
scan for the right information
• On a small screen, there’s
hugely more value from a
system that can show you what
you want:
– phone number
– business hours
– email summary
• “Call me at 11 to finalize this”
NLP for IR/web search?
• It’s a no-brainer that NLP should be useful
and used for web search (and IR in general):
– Search for ‘Jaguar’
• the computer should know or ask whether you’re
interested in big cats [scarce on the web], cars, or,
perhaps a molecule geometry and solvation energy
package, or a package for fast network I/O in Java
– Search for ‘Michael Jordan’
• The basketballer or the machine learning guy?
– Search for laptop, don’t find notebook
– Google doesn’t even stem:
• Search for probabilistic model, and you don’t even match
pages with probabilistic models.
NLP for IR/web search?
• Word sense disambiguation technology
generally works well (like text categorization)
• Synonyms can be found or listed
• Lots of people have been into fixing this
– e-Cyc had a beta version with Hotbot that
disambiguated senses, and was going to go live in
2 months … 14 months ago
– Lots of startups:
• LingoMotors
• iPhrase “Traditional keyword search technology is
hopelessly outdated”
NLP for IR/web search?
• But in practice it’s an idea that hasn’t gotten
much traction
– Correctly finding linguistic base forms is
straightforward, but produces little advantage over
crude stemming which just slightly over
equivalence classes words
– Word sense disambiguation only helps on average
in IR if over 90% accurate (Sanderson 1994), and
that’s about where we are
– Syntactic phrases should help, but people have
been able to get most of the mileage with
“statistical phrases” – which have been
aggressively integrated into systems recently
NLP for IR/web search?
• People can easily scan among results (on
their 21” monitor) … if you’re above the fold
• Much more progress has been made in link
analysis, and use of anchor text, etc.
• Anchor text gives human-provided synonyms
• Link or click stream analysis gives a form of
pragmatics: what do people find correct or
important (in a default context)
• Focus on short, popular queries, news, etc.
• Using human intelligence always beats
artificial intelligence
NLP for IR/web search?
• Methods which use of rich ontologies, etc.,
can work very well for intranet search within a
customer’s site (where anchor-text, link, and
click patterns are much less relevant)
– But don’t really scale to the whole web
• Moral: it’s hard to beat keyword search for the
task of general ad hoc document retrieval
• Conclusion: one should move up the food
chain to tasks where finer grained
understanding of meaning is needed
Product information
Product info
• C-net markets
this information
• How do they get
most of it?
– Phone calls
– Typing.
Inconsistency: digital cameras
• Image Capture Device: 1.68 million pixel 1/2-inch CCD
sensor
• Image Capture Device Total Pixels Approx. 3.34 million
Effective Pixels Approx. 3.24 million
• Image sensor Total Pixels: Approx. 2.11 million-pixel
• Imaging sensor Total Pixels: Approx. 2.11 million 1,688
(H) x 1,248 (V)
• CCD Total Pixels: Approx. 3,340,000 (2,140[H] x 1,560
[V] )
– Effective Pixels: Approx. 3,240,000 (2,088 [H] x 1,550 [V] )
– Recording Pixels: Approx. 3,145,000 (2,048 [H] x 1,536 [V] )
• These all came off the same manufacturer’s
website!!
• And this is a very technical domain. Try sofa beds.
Product information/ Comparison
shopping, etc.
• Need to learn to extract info from online
vendors
• Can exploit uniformity of layout, and (partial)
knowledge of domain by querying with known
products
• E.g., Jango Shopbot (Etzioni and Weld)
– Gives convenient aggregation of online content
• Bug: not popular with vendors
– A partial solution is for these tools to be personal
agents rather than web services
Question answering from text
• TREC 8/9 QA competition: an idea originating
from the IR community
• With massive collections of on-line documents,
manual translation of knowledge is impractical:
we want answers from textbases [cf. bioinformatics]
• Evaluated output is 5 answers of 50/250 byte
snippets of text drawn from a 3 Gb text collection,
and required to contain at least one concept of
the semantic category of the expected answer
type. (IR think. Suggests the use of named entity
recognizers.)
• Get reciprocal points for highest correct answer.
Pasca and Harabagiu (200) show
value of sophisticated NLP
• Good IR is needed: paragraph retrieval based
on SMART
• Large taxonomy of question types and
expected answer types is crucial
• Statistical parser (modeled on Collins 1997)
used to parse questions and relevant text for
answers, and to build knowledge base
• Controlled query expansion loops
(morphological, lexical synonyms, and
semantic relations) are all important
• Answer ranking by simple ML method
Question Answering Example
• How hot does the inside of an active volcano
get?
• get(TEMPERATURE, inside(volcano(active)))
• “lava fragments belched out of the mountain
were as hot as 300 degrees Fahrenheit”
• fragments(lava, TEMPERATURE(degrees(300)),
belched(out, mountain))
– volcano ISA mountain
– lava ISPARTOF volcano  lava inside volcano
– fragments of lava HAVEPROPERTIESOF lava
• The needed semantic information is in WordNet
definitions, and was successfully translated into a
Conclusion
• Complete human-level natural language
understanding is still a distant goal
• But there are now practical and usable partial
NLU systems applicable to many problems
• An important design decision is in finding an
appropriate match between (parts of) the
application domain and the available methods
• But, used with care, statistical NLP methods
have opened up new possibilities for high
performance text understanding systems.
NLP Research at the Turing
Center
Oren Etzioni
Turing Center
University of Washington
http://turing.cs.washington.edu
Turing Center Foci
•
Scale MT to 49,000,000 language pairs
–
–
–
2,500,000 word translation graph
P(V  F  C)?
PanImages
•
Accumulate knowledge from the Web
•
A new paradigm for Web Search
Outline
1. A New Paradigm for Search
2. Open Information Extraction
3. Tractable Inference
4. Conclusions
Web Search in 2020?
•
•
•
•
Type key words into a search box?
Social or “human powered” Search?
The Semantic Web?
What about our technology exponentials?
“The best way to predict the future
is to invent it!”
Intelligent Search
Instead of merely retrieving Web pages, read ‘em!
Machine Reading = Information Extraction (IE) +
tractable inference
• IE(sentence) = who did what?
– speaker(Alon Halevy, UW)
• Inference = uncover implicit information
– Will Alon visit Seattle?
Application: Information Fusion
• What kills bacteria?
• What west coast, nano-technology
companies are hiring?
• Compare Obama’s “buzz” versus Hillary’s?
• What is a quiet, inexpensive, 4-star hotel in
Vancouver?
Opinion Mining
• Opine (Popescu & Etzioni, EMNLP ’05)
• IE(product reviews)
– Informative
– Abundant, but varied
– Textual
• Summarize reviews without any prior
knowledge of product category
But “Reading” the Web is Tough
• Traditional IE is narrow
• IE has been applied to small, homogenous
corpora
• No parser achieves high accuracy
• No named-entity taggers
• No supervised learning
How about semi-supervised learning?
Semi-Supervised Learning
•
•
•
•
Few hand-labeled examples per concept!
 Limit on the number of concepts
 Concepts are pre-specified
 Problematic for the Web
• Alternative: self-supervised learning
– Learner discovers concepts on the fly
– Learner automatically labels examples
2. Open IE = Self-supervised IE
(Banko, Cafarella, Soderland, et. al, IJCAI ’07)
Input:
Relations:
Complexity:
Text analysis:
Traditional IE
Open IE
Corpus + Handlabeled Data
Corpus
Specified
Advance
in
Discovered
Automatically
O(D * R)
R relations
O(D)
D documents
Parser + Namedentity tagger
NP Chunker
Extractor Overview (Banko & Etzioni, ’08)
1.
2.
3.
4.
Use a simple model of relationships in English to label
extractions
Bootstrap a general model of relationships in English
sentences, encoded as a CRF
Decompose each sentence into one or more (NP1, VP,
NP2) “chunks”
Use CRF model to retain relevant parts of each NP and
VP.
The extractor is relation-independent!
TextRunner Extraction
• Extract Triple representing binary relation
(Arg1, Relation, Arg2) from sentence.
Internet powerhouse, EBay, was originally
founded by Pierre Omidyar.
Internet powerhouse, EBay, was originally
founded by Pierre Omidyar.
(Ebay, Founded by, Pierre Omidyar)
Numerous Extraction Challenges
• Drop non-essential info:
“was originally founded by”  founded by
• Retain key distinctions
Ebay founded by Pierr ≠ Ebay founded Pierre
• Non-verb relationships
“George Bush, president of the U.S…”
• Synonymy & aliasing
Albert Einstein = Einstein ≠ Einstein Bros.
TextRunner
(Web’s 1st Open IE system)
1. Self-Supervised Learner: automatically labels
example extractions & learns an extractor
2. Single-Pass Extractor: single pass over corpus,
identifying extractions in each sentence
3. Query Processor: indexes extractions enables
queries at interactive speeds
TextRunner Demo
Sample of 9 million Web Pages
Triples
11.3 million
With Well-Formed Relation
9.3 million
• Concrete facts:
(Oppenheimer, taught at,
Berkeley)
With Well-Formed Entities
7.8 million
• Abstract facts: (fruit,
contain, vitamins)
Abstract
6.8 million
79.2% correct
Concrete
1.0 million
88.1%
correct
3. Tractable Inference
Much of textual information is implicit
I.
II.
III.
Entity and predicate resolution
Probability of correctness
Composing facts to draw conclusions
I.
Entity Resolution
Resolver (Yates & Etzioni, HLT ’07):
determines synonymy based on relations
found by TextRunner (cf. Pantel & Lin ‘01)
• (X, born in, 1941)
(M, born in, 1941)
• (X, citizen of, US)
(M, citizen of, US)
• (X, friend of, Joe)
(M, friend of, Mary)
P(X = M) ~ shared relations
Relation Synonymy
•
•
•
•
(1, R, 2)
(2, R 4)
(4, R, 8)
Etc.
•
•
•
•
(1, R’ 2)
(2, R’, 4)
(4, R’ 8)
Etc.
P(R = R’) ~ shared argument pairs
•Unsupervised probabilistic model
•O(N log N) algorithm run on millions of docs
II.
Probability of Correctness
How likely is an extraction to be correct?
Factors to consider include:
• Authoritativeness of source
• Confidence in extraction method
• Number of independent extractions
III Compositional Inference
(work in progress,
Schoenmackers, Etzioni, Weld)
Implicit information, (2+2=4)
• TextRunner: (Turing, born in, London)
• WordNet: (London, part of, England)
• Rule: ‘born in’ is transitive thru ‘part of’
• Conclusion: (Turing, born in, England)
• Mechanism: MLN instantiated on the fly
• Rules: learned from corpus (future work)
• Inference Demo
KnowItAll Family Tree
Mulder ‘01
WebKB ‘99
PMI-IR ‘01
KnowItAll, ‘04
Opine ‘05
BE ‘05
Woodward ‘06
Resolver ‘07
REALM ‘07
Urns
KnowItNow ‘05
TextRunner ‘07
Inference ‘08
KnowItAll Team
•
•
•
•
•
•
•
•
Michele Banko
Michael Cafarella
Doug Downey
Alan Ritter
Dr. Stephen Soderland
Stefan Schoenmackers
Prof. Dan Weld
Mausam
• Alumni: Dr. Ana-Maria
Popescu, Dr. Alex Yates,
and others.
4. Conclusions
Imagine search systems that operate over a (more)
semantic space
 Key words, documents  extractions
 TF-IDF, pagerank  relational models
 Web pages, hyper links  entities, relns
Reading the Web  new Search Paradigm