Sliding Window

Download Report

Transcript Sliding Window

Information Extraction
from the
World Wide Web
CSE 454
Based on Slides by
William W. Cohen
Carnegie Mellon University
Andrew McCallum
University of Massachusetts Amherst
From KDD 2003
todo
•
•
•
•
•
Add explanation of top down rule learning
Add explanation of wrappers for BWI
Better example of HMM structure for cora DB
Fix slide 79
Review learning with complete info
Group Meetings
• Thursday ??
• Tuesday
Class Overview
Info Extraction
Self-Supervised
HMMs
Sliding Windows
Query processing
Indexing
IR - Ranking
Content Analysis
Crawling
Network Layer
Information Extraction
Example: The Problem
Martin Baker, a person
Genomics job
Employers job posting form
Slides from Cohen & McCallum
Example: A Solution
Slides from Cohen & McCallum
Extracting Job Openings from the Web
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.htm
OtherCompanyJobs: foodscience.com-Job1
Slides from Cohen & McCallum
Slides from Cohen & McCallum
Category = Food Services
Keyword = Baker
Location = Continental U.S.
Job Openings:
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
NAME
TITLE
ORGANIZATION
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
IE
NAME
Bill Gates
Bill Veghte
Richard Stallman
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + clustering + association
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
IE in Context
Create ontology
Spider
Filter by relevance
Load DB
IE
Document
collection
Database
Query,
Search
Data mine
Train extraction models
Label training data
Slides from Cohen & McCallum
IE History
Pre-Web
• Mostly news articles
– De Jong’s FRUMP [1982]
• Hand-built system to fill Schank-style “scripts” from news wire
– Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92’96]
• Most early work dominated by hand-built models
– E.g. SRI’s FASTUS, hand-built FSMs.
– But by 1990’s, some machine learning: Soderland/Lehnert, Cardie, Grishman
– Then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]
Web
• AAAI ’94 Spring Symposium on “Software Agents”
– Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.
• Tom Mitchell’s WebKB, ‘96
– Build KB’s from the Web.
• Wrapper Induction
– First by hand, then ML: [Doorenbos ‘96], [Soderland ’96], [Kushmerick ’97],…
Slides from Cohen & McCallum
Landscape of IE Tasks (1/4):
Pattern Feature Domain
Text paragraphs
without formatting
Grammatical sentences
and some formatting & links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.
Non-grammatical snippets,
rich formatting & links
Tables
Slides from Cohen & McCallum
Landscape of IE Tasks (2/4):
Pattern Scope
Web site specific
Formatting
Amazon Book Pages
Genre specific
Layout
Resumes
Wide, non-specific
Language
University Names
Slides from Cohen & McCallum
Landscape of IE Tasks (3/4):
Pattern Complexity
E.g. word patterns:
Closed set
Regular set
U.S. states
U.S. phone numbers
He was born in Alabama…
Phone: (413) 545-1323
The big Wyoming sky…
The CALD main office can be
reached at 412-268-1299
Complex pattern
U.S. postal addresses
University of Arkansas
P.O. Box 140
Hope, AR 71802
Headquarters:
1128 Main Street, 4th Floor
Cincinnati, Ohio 45210
Ambiguous patterns,
needing context and
many sources of evidence
Person names
…was among the six houses
sold by Hope Feldman that year.
Pawel Opalinski, Software
Engineer at WhizBang Labs.
Slides from Cohen & McCallum
Landscape of IE Tasks (4/4):
Pattern Combinations
Jack Welch will retire as CEO of General Electric
tomorrow. The top role at the Connecticut company
will be filled by Jeffrey Immelt.
Single entity
Binary relationship
N-ary record
Person: Jack Welch
Relation: Person-Title Relation: Succession
Person: Jack Welch Company: General Elec
Person: Jeffrey Immelt Title:
CEO
Title:
CEO
Out:
Jack Welsh
In:
Jeffrey Imme
Location: Connecticut Relation: Company-Location
Company: General Electric
Location: Connecticut
“Named
entity” extraction
Slides from Cohen & McCallum
Landscape of IE Models
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama
Alaska
…
Wisconsin
Wyoming
Classify Pre-segmented
Sliding Window
Candidates
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Classifier
Classifier
which class?
which class?
Try alternate
window sizes:
Context Free Gramm
Boundary Models Finite State Machines
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
NNP
NNP
V
V
P
Classifier
PP
which class?
VP
NP
BEGIN
END
BEGIN
NP
END
VP
S
…and beyond
Any of these models can be used to capture words, formatting or
both.
Slides from Cohen & McCallum
Evaluation of Single Entity Extraction
TRUTH:
Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.
PRED:
Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.
Precision =
Recall
=
# correctly predicted segments
=
2
# predicted segments
6
# correctly predicted segments
2
# true segments
=
4
1
F1 = Harmonic mean of Prec. + Recall = ((1/P) + (1/R))/2
Slides from Cohen & McCallum
State of the Art Performance
• Named entity recognition
– Person, Location, Organization, …
– F1 in high 80’s or low- to mid-90’s
• Binary relation extraction
– Contained-in (Location1, Location2)
Member-of (Person1, Organization1)
– F1 in 60’s or 70’s or 80’s
• Wrapper induction
– Extremely accurate performance obtainable
– Human effort (~30min) required on each site
Slides from Cohen & McCallum
Current Research
• Open Information Extraction
• Large Scale Extraction
– Eg 5000 relations
– Distant Supervision
• Rapid Supervision
– Minimal labeled examples
– Minimal tuning
Landscape:
Focus of this Tutorial
Pattern complexity closed set regular
Pattern feature domain words
Pattern scope
ambiguous
words + formatting formatting
site-specific genre-specific general
Pattern combinations entity
Models
complex
binary
n-ary
lexicon regex window boundary FSM CFG
Slides from Cohen & McC
Quick Review
Bayes Theorem
1702-1761
P( E | H ) P( H )
P( H | E ) 
P( E )
28
Bayesian Categorization
• Let set of categories be {c1, c2,…cn}
• Let E be description of an instance.
• Determine category of E by determining for each ci
P(ci | E ) 
P(ci ) P( E | ci )
P( E )
• P(E) can be determined since categories are complete
and disjoint.
n
n
i 1
i 1
 P(ci | E )  
P(ci ) P( E | ci )
1
P( E )
n
P( E )   P(ci ) P( E | ci )
i 1
29
Naïve Bayesian Motivation
• Problem: Too many possible instances (exponential in
m) to estimate all P(E | ci)
• If we assume features of an instance are independent
given the category (ci) (conditionally independent).
m
P( E | ci )  P(e1  e2    em | ci )   P(e j | ci )
j 1
• Therefore, we then only need to know P(ej | ci) for each
feature and category.
30
Probabilistic Graphical Models
• Nodes = Random Variables
• Directed Edges = Causal Connections
– Associated with a CPT (conditional probability table)
Spam?
Hidden state
y
Causal dependency
(probabilistic)
P(xi | y=spam)
P(xi | y≠spam)
Random variables
(Boolean)
Observable
x1
Nigeria?
x2
Widow?
x3
CSE 454?
Recap: Naïve Bayes
• Assumption: features independent given label
• Generative Classifier
– Model joint distribution p(x,y)
– Inference
:::
– Learning: counting
– Can we use for IE directly?
The article appeared in the Seattle Times.
city?
capitalization
length
suffix
Labels of
neighboring
words
dependent!
Need to
consider
sequence!
Sliding Windows
Slides from Cohen & McCallum
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen & McCallum
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen & McCallum
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen & McCallum
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen & McCallum
A “Naïve Bayes” Sliding Window Model
[Freitag 1997]
…
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun …
w t-m
w t-1 w t
w t+n
w t+n+1
w t+n+m
prefix
contents
suffix
Estimate Pr(LOCATION | window) using Bayes rule
Try all “reasonable” windows (vary length, position)
Assume independence for length, prefix, suffix, content words
Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)
If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.
Other examples of sliding window: [Baluja et al 2000]
(decision tree over individual words & their context)
Slides from Cohen & McCallum
“Naïve Bayes” Sliding Window Results
Domain: CMU UseNet Seminar Announcements
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during
the 1980s and 1990s.
As a result of its
success and growth, machine learning is
evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning), genetic
algorithms, connectionist learning, hybrid
systems, and so on.
Field
Person Name:
Location:
Start Time:
F1
30%
61%
98%
Slides from Cohen & McCallum
Realistic sliding-window-classifier IE
• What windows to consider?
– all windows containing as many tokens as the
shortest example, but no more tokens than the
longest example
• How to represent a classifier? It might:
– Restrict the length of window;
– Restrict the vocabulary or formatting used
before/after/inside window;
– Restrict the relative order of tokens, etc.
• Learning Method
– SRV: Top-Down Rule Learning
[Frietag AAAI ‘98]
– Rapier: Bottom-Up
[Califf & Mooney, AAAI ‘99]
Slides from Cohen & McCallum
Rapier: results – precision/recall
Slides from Cohen & McCallum
Rapier – results vs. SRV
Slides from Cohen & McCallum
Rule-learning approaches to slidingwindow classification: Summary
• SRV, Rapier, and WHISK [Soderland KDD ‘97]
– Representations for classifiers allow restriction of
the relationships between tokens, etc
– Representations are carefully chosen subsets of
even more powerful representations based on
logic programming (ILP and Prolog)
– Use of these “heavyweight” representations is
complicated, but seems to pay off in results
• Can simpler representations for classifiers
work?
Slides from Cohen & McCallum
BWI: Learning to detect boundaries
[Freitag & Kushmerick, AAAI 2000]
• Another formulation: learn 3 probabilistic classifiers:
– START(i) = Prob( position i starts a field)
– END(j) = Prob( position j ends a field)
– LEN(k) = Prob( an extracted field has length k)
• Score a possible extraction (i,j) by
START(i) * END(j) * LEN(j-i)
• LEN(k) is estimated from a histogram
Slides from Cohen & McCallum
BWI: Learning to detect boundaries
• BWI uses boosting to find “detectors” for
START and END
• Each weak detector has a BEFORE and
AFTER pattern (on tokens before/after
position i).
• Each “pattern” is a sequence of
– tokens and/or
– wildcards like: anyAlphabeticToken, anyNumber, …
• Weak learner for “patterns” uses greedy
search (+ lookahead) to repeatedly extend a
pair of empty BEFORE,AFTER patterns
Slides from Cohen & McCallum
BWI: Learning to detect boundaries
Naïve Bayes
Field
F1
Speaker:
30%
Location:
61%
Start Time:
98%
Slides from Cohen & McCallum
Problems with Sliding Windows
and Boundary Finders
• Decisions in neighboring parts of the input
are made independently from each other.
– Naïve Bayes Sliding Window may predict a
“seminar end time” before the “seminar start time”.
– It is possible for two overlapping windows to both
be above threshold.
– In a Boundary-Finding system, left boundaries are
laid down independently from right boundaries,
and their pairing happens as a separate step.
Slides from Cohen & McCallum
Landscape of IE Techniques (1/1):
Models
Classify Pre-segmented
Candidates
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama
Alaska
…
Wisconsin
Wyoming
Boundary Models
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
Classifier
which class?
which class?
Try alternate
window sizes:
Finite State Machines
Abraham Lincoln was born in Kentucky.
Context Free Grammars
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
NNP
NNP
V
V
P
Classifier
PP
which class?
VP
NP
BEGIN
END
BEGIN
NP
END
Each model can capture words, formatting, or both
VP
S
Slides from Cohen & McCallum
Finite State Machines
Slides from Cohen & McCallum
Hidden Markov Models (HMMs)
standard sequence model in genomics, speech, NLP, …
Graphical model
Finite state model
S t-1
St
S t+1
...
...
observations
...
O
Generates:
State
sequence
Observation
sequence
transitions
Ot
t -1
O t +1

|o|
o1
o2
o3
o4
o5
o6
o7
o8
 
P(s , o )   P(st | st 1 ) P(ot | st )
t 1
Parameters: for all states S={s1,s2,…}
Start state probabilities: P(st )
Transition probabilities:
P(st|st-1 )
Usually a multinomial over
atomic, fixed alphabet
Observation (emission) probabilities: P(ot|st )
Training:
Maximize probability of training observations (w/
prior)
Slides
from Cohen & McCallum
Example: The Dishonest Casino
A casino has two dice:
• Fair die
P(1) = P(2) = P(3) = P(5) = P(6) = 1/6
• Loaded die
P(1) = P(2) = P(3) = P(5) = 1/10
P(6) = 1/2
Casino player switches back-&-forth
between fair and loaded die once every
20 turns
Game:
1. You bet $1
2. You roll (always with a fair die)
3. Casino player rolls (maybe with fair die,
maybe with loaded die)
4. Highest number wins $2
Slides from Serafim Batzoglou
Question # 1 – Evaluation
GIVEN
A sequence of rolls by the casino player
12455264621461461361366616646616366163661636
1…
QUESTION
How likely is this sequence, given our model of
how the casino works?
This is the EVALUATION problem inSlides
HMMs
from Serafim Batzoglou
Question # 2 – Decoding
GIVEN
A sequence of rolls by the casino player
1245526462146146136136661664661636616366163
…
QUESTION
What portion of the sequence was generated
with the fair die, and what portion with the
loaded die?
Slides from Serafim Batzoglou
Question # 3 – Learning
GIVEN
A sequence of rolls by the casino player
124552646214614613613666166466163661636616361651…
QUESTION
How “loaded” is the loaded die? How “fair” is the fair
die? How often does the casino player change from
fair to loaded, and back?
This is the LEARNING question in HMMs
Slides from Serafim Batzoglou
The dishonest casino model
0.05
0.95
FAIR
P(1|F) = 1/6
P(2|F) = 1/6
P(3|F) = 1/6
P(4|F) = 1/6
P(5|F) = 1/6
P(6|F) = 1/6
0.95
LOADED
0.05
P(1|L) = 1/10
P(2|L) = 1/10
P(3|L) = 1/10
P(4|L) = 1/10
P(5|L) = 1/10
P(6|L) = 1/2
Slides from Serafim Batzoglou
What’s this have to do with Info Extraction?
0.05
0.95
FAIR
P(1|F) = 1/6
P(2|F) = 1/6
P(3|F) = 1/6
P(4|F) = 1/6
P(5|F) = 1/6
P(6|F) = 1/6
0.95
LOADED
0.05
P(1|L) = 1/10
P(2|L) = 1/10
P(3|L) = 1/10
P(4|L) = 1/10
P(5|L) = 1/10
P(6|L) = 1/2
What’s this have to do with Info Extraction?
0.05
0.95
TEXT
P(the | T) = 0.003
P(from | T) = 0.002
…..
0.95
NAME
0.05
P(Dan | N) = 0.005
P(Sue | N) = 0.003
…
IE with Hidden Markov Models
Given a sequence of observations:
Yesterday Pedro Domingos spoke this example sentence.
and a trained HMM:
person name
location name
background
 
Find the most likely state sequence: (Viterbi) arg max s P( s , o )
Yesterday Pedro Domingos spoke this example sentence.
Any words said to be generated by the designated “person name”
state extract as a person name:
Person name: Pedro Domingos
Slide by Cohen & McCallum
IE with Hidden Markov Models
For sparse extraction tasks :
• Separate HMM for each type of target
• Each HMM should
– Model entire document
– Consist of target and non-target states
– Not necessarily fully connected
Slide by Okan Basegmez
59
Or … Combined HMM
• Example – Research Paper Headers
Slide by Okan Basegmez
60
HMM Example: “Nymble”
[Bikel, et al 1998],
[BBN “IdentiFinder”]
Task: Named Entity Extraction
Transition
probabilities
Person
start-ofsentence
Observation
probabilities
end-ofsentence
Org
or
(Five other name classes)
Back-off to:
Back-off to:
Other
Train on ~500k words of news wire text.
Results:
Case
Mixed
Upper
Mixed
Language
English
English
Spanish
F1 .
93%
91%
90%
Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]
Slide by Cohen & McCallum
HMM Example: “Nymble”
Task: Named Entity Extraction
[Bikel, et al 1998],
[BBN “IdentiFinder”]
Person
start-ofsentence
end-ofsentence
Org
(Five other name classes)
Train on ~500k words of
news wire text.
Other
Results: Case
Mixed
Upper
Mixed
Slide adapted from Cohen & McCallum
Language
English
English
Spanish
F1 .
93%
91%
90%
Finite State Model
Person
start-ofsentence
end-ofsentence
Org
(Five other name classes)
vs. Path
Other
y1
x1
y2
y3
y4
y5
y6
…
x2
x3
x4
x5
x6
…
Question #1 – Evaluation
GIVEN
A sequence of observations x1 x2 x3 x4 ……xN
A trained HMM
θ=(
,
,
)
QUESTION
How likely is this sequence, given our HMM ?
P(x,θ)
Why do we care?
Need it for learning to choose among competing
models!
Question #2 - Decoding
GIVEN
A sequence of observations x1 x2 x3 x4 ……xN
A trained HMM
θ=(
,
,
)
QUESTION
How dow we choose the corresponding parse (state sequence) y1 y2
y3 y4 ……yN , which “best” explains x1 x2 x3 x4 ……xN ?
There are several reasonable optimality criteria:
single optimal sequence, average statistics for
individual states, …
A parse of a sequence
Given a sequence x = x1……xN,
A parse of o is a sequence of states y = y1, ……, yN
person
other
location
1
1
1
…
1
2
2
2
…
2
…
…
…
K
K
K
x1
x2
x3
Slide by Serafim Batzoglou
…
…
K
xK
Question #3 - Learning
GIVEN
A sequence of observations x1 x2 x3 x4 ……xN
QUESTION
How do we learn the model parameters
θ =(
,
which maximize P(x, θ ) ?
,
)
Three Questions
• Evaluation
– Forward algorithm
– (Could also go other direction)
• Decoding
– Viterbi algorithm
• Learning
– Baum-Welch Algorithm (aka “forward-backward”)
• A kind of EM (expectation maximization)
– Perceptron Updates
A Solution to #1: Evaluation
Given observations x=x1 …xN and HMM θ, what is p(x) ?
Enumerate every possible state sequence y=y1 …yN
Probability of x and given particular y
Probability of particular y
2T multiplications
per sequence
Summing over all possible state sequences we get
For small HMMs
T=10, N=10
there are 10
billion sequences!
NT state sequences!
Solution to #1: Evaluation
Use Dynamic Programming:
Define forward variable
probability that at time t
- the state is Si
- the partial observation sequence x=x1 …xt has been emitted
Forward Variable t(i)
Prob - that the state at time t vas value Si and
- the partial obs sequence x=x1 …xt has been seen
person
1
1
1
…
1
…
other
2
2
2
…
2
…
…
…
…
K
K
K
x1
x2
x3
location
…
…
K
xt
…
Forward Variable t(i)
prob - that the state at t vas value Si and
- the partial obs sequence x=x1 …xt has been seen
person
1
1
1
…
1
…
other
2
2
2
…
2
…
…
…
…
K
K
K
x1
x2
x3
location
S
…i
…
K
xt
…
Solution to #1: Evaluation
• Use Dynamic Programming
• Cache and reuse inner sums
• Define forward variables
probability that at time t
- the state is yt = Si
- the partial observation sequence x=x1 …xt has been omitted
The Forward Algorithm
The Forward Algorithm
INITIALIZATION
INDUCTION
TERMINATION
Time:
Space:
O(K2N)
O(KN)
K = |S|
N
#states
length of sequence
The Backward Algorithm
The Backward Algorithm
INITIALIZATION
INDUCTION
TERMINATION
Time:
Space:
O(K2N)
O(KN)
Three Questions
• Evaluation
– Forward algorithm
– (Could also go other direction)
• Decoding
– Viterbi algorithm
• Learning
– Baum-Welch Algorithm (aka “forward-backward”)
– A kind of EM (expectation maximization)
Need new slide!
Solution to #2 - Decoding
Given x=x1 …xN and HMM θ, what is “best” parse y1 …yN?
Several optimal solutions
• 1. States which are individually most likely
• 2. Single best state sequence
We want to find sequence y1 …yN,
such that P(x,y) is maximized
y* = argmaxy P( x, y )
Again, we can use dynamic programming!
1
1
1
…
1
2
2
2
…
2
…
K
…
K
…
K
…
…
K
o1
o2
o3
oK
The Viterbi Algorithm
DEFINE
INITIALIZATION
INDUCTION
TERMINATION
Backtracking to get state
sequence y*
The Viterbi Algorithm
x1 x2 ……xj-1 xj……………………………..xT
State 1
2
i
δj(i)
K
Time:
O(K2T)
Linear in length of sequence
Space:
O(KT)
Remember:
δk(i) = probability of most likely
state seq ending with state Sk
Slides from Serafim Batzoglou
The Viterbi Algorithm
Pedro Domingos
83
Three Questions
• Evaluation
– Forward algorithm
– (Could also go other direction)
• Decoding
– Viterbi algorithm
• Learning
– Baum-Welch Algorithm (aka “forward-backward”)
– A kind of EM (expectation maximization)
Solution to #3 - Learning
Given x1 …xN , how do we learn θ =(
,
,
) to maximize P(x)?
• Unfortunately, there is no known way to analytically find a
global maximum θ * such that
θ * = arg max P(o | θ)
• But it is possible to find a local maximum; given an initial model
θ, we can always find a model θ’ such that
P(o | θ’) ≥ P(o | θ)
Chicken & Egg Problem
• If we knew the actual sequence of states
– It would be easy to learn transition and emission probabilities
– But we can’t observe states, so we don’t!
• If we knew transition & emission probabilities
– Then it’d be easy to estimate the sequence of states (Viterbi)
– But we don’t know them!
Slide by Daniel S. Weld
86
Simplest Version
• Mixture of two distributions
• Know: form of.01distribution
& variance,
.03 .05 .07 .09
% =5
• Just need mean of each distribution
Slide by Daniel S. Weld
87
Input Looks Like
.01
Slide by Daniel S. Weld
.03
.05
.07
88
.09
We Want to Predict
?
.01
Slide by Daniel S. Weld
.03
.05
.07
89
.09
Chicken & Egg
Note that coloring instances would be easy
if we knew Gausians….
.01
Slide by Daniel S. Weld
.03
.05
.07
90
.09
Chicken & Egg
And finding the Gausians would be easy
If we knew the coloring
.01
Slide by Daniel S. Weld
.03
.05
.07
91
.09
Expectation Maximization (EM)
• Pretend we do know the parameters
– Initialize randomly: set
1=?; 2=?
.01 .03 .05 .07 .09
Slide by Daniel S. Weld
92
Expectation Maximization (EM)
• Pretend we do know the parameters
– Initialize randomly
• [E step] Compute probability of instance having
each possible value of the hidden variable
.01
Slide by Daniel S. Weld
.03
.05
.07
93
.09
Expectation Maximization (EM)
• Pretend we do know the parameters
– Initialize randomly
• [E step] Compute probability of instance having
each possible value of the hidden variable
.01
Slide by Daniel S. Weld
.03
.05
.07
94
.09
Expectation Maximization (EM)
• Pretend we do know the parameters
– Initialize randomly
• [E step] Compute probability of instance having
each possible value of the hidden variable
[M step] Treating each instance as fractionally
having both values compute the new parameter
values
.01
Slide by Daniel S. Weld
.03
.05
.07
95
.09
ML Mean of Single Gaussian
Uml = argminu
 (x – u)
i
i
2
.01 .03 .05 .07 .09
Slide by Daniel S. Weld
96
Expectation Maximization (EM)
• [E step] Compute probability of instance having each
possible value of the hidden variable
[M step] Treating each instance as fractionally
having both values compute the new parameter
values
.01
Slide by Daniel S. Weld
.03
.05
.07
97
.09
Expectation Maximization (EM)
• [E step] Compute probability of instance having
each possible value of the hidden variable
.01
Slide by Daniel S. Weld
.03
.05
.07
98
.09
Expectation Maximization (EM)
• [E step] Compute probability of instance having
each possible value of the hidden variable
[M step] Treating each instance as fractionally
having both values compute the new parameter
values
.01
Slide by Daniel S. Weld
.03
.05
.07
99
.09
Expectation Maximization (EM)
• [E step] Compute probability of instance having
each possible value of the hidden variable
[M step] Treating each instance as fractionally
having both values compute the new parameter
values
.01
Slide by Daniel S. Weld
.03
.05
.07
100
.09
EM for HMMs
• [E step] Compute probability of instance having
each possible value of the hidden variable
– Compute the forward and backward probabilities for given
model parameters and our observations
[M step] Treating each instance as fractionally
having both values compute the new parameter
values
- Re-estimate the model parameters
- Simple Counting
101
Summary - Learning
• Use hill-climbing
– Called the forward-backward (or Baum/Welch) algorithm
• Idea
– Use an initial parameter instantiation
– Loop
• Compute the forward and backward probabilities for given
model parameters and our observations
• Re-estimate the parameters
– Until estimates don’t change much
IE Resources
• Data
– RISE, http://www.isi.edu/~muslea/RISE/index.html
– Linguistic Data Consortium (LDC)
• Penn Treebank, Named Entities, Relations, etc.
– http://www.biostat.wisc.edu/~craven/ie
– http://www.cs.umass.edu/~mccallum/data
• Code
– TextPro, http://www.ai.sri.com/~appelt/TextPro
– MALLET, http://www.cs.umass.edu/~mccallum/mallet
– SecondString, http://secondstring.sourceforge.net/
• Both
– http://www.cis.upenn.edu/~adwait/penntools.html
Slides from Cohen & McCallum
References
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
[Bikel et al 1997] Bikel, D.; Miller, S.; Schwartz, R.; and Weischedel, R. Nymble: a high-performance learning name-finder. In
Proceedings of ANLP’97, p194-201.
[Califf & Mooney 1999], Califf, M.E.; Mooney, R.: Relational Learning of Pattern-Match Rules for Information Extraction, in
Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99).
[Cohen, Hurst, Jensen, 2002] Cohen, W.; Hurst, M.; Jensen, L.: A flexible learning system for wrapping tables and lists in HTML
documents. Proceedings of The Eleventh International World Wide Web Conference (WWW-2002)
[Cohen, Kautz, McAllester 2000] Cohen, W; Kautz, H.; McAllester, D.: Hardening soft information sources. Proceedings of the
Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000).
[Cohen, 1998] Cohen, W.: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual
Similarity, in Proceedings of ACM SIGMOD-98.
[Cohen, 2000a] Cohen, W.: Data Integration using Similarity Joins and a Word-based Information Representation Language,
ACM Transactions on Information Systems, 18(3).
[Cohen, 2000b] Cohen, W. Automatically Extracting Features for Concept Learning from the Web, Machine Learning:
Proceedings of the Seventeeth International Conference (ML-2000).
[Collins & Singer 1999] Collins, M.; and Singer, Y. Unsupervised models for named entity classification. In Proceedings of the
Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.
[De Jong 1982] De Jong, G. An Overview of the FRUMP System. In: Lehnert, W. & Ringle, M. H. (eds), Strategies for Natural
Language Processing. Larence Erlbaum, 1982, 149-176.
[Freitag 98] Freitag, D: Information extraction from HTML: application of a general machine learning approach, Proceedings of the
Fifteenth National Conference on Artificial Intelligence (AAAI-98).
[Freitag, 1999], Freitag, D. Machine Learning for Information Extraction in Informal Domains. Ph.D. dissertation, Carnegie Mellon
University.
[Freitag 2000], Freitag, D: Machine Learning for Information Extraction in Informal Domains, Machine Learning 39(2/3): 99-101
(2000).
Freitag & Kushmerick, 1999] Freitag, D; Kushmerick, D.: Boosted Wrapper Induction. Proceedings of the Sixteenth National
Conference on Artificial Intelligence (AAAI-99)
[Freitag & McCallum 1999] Freitag, D. and McCallum, A. Information extraction using HMMs and shrinakge. In Proceedings
AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI Technical Report WS-99-11.
[Kushmerick, 2000] Kushmerick, N: Wrapper Induction: efficiency and expressiveness, Artificial Intelligence, 118(pp 15-68).
[Lafferty, McCallum & Pereira 2001] Lafferty, J.; McCallum, A.; and Pereira, F., Conditional Random Fields: Probabilistic Models
for Segmenting and Labeling Sequence Data, In Proceedings of ICML-2001.
[Leek 1997] Leek, T. R. Information extraction using hidden Markov models. Master’s thesis. UC San Diego.
[McCallum, Freitag & Pereira 2000] McCallum, A.; Freitag, D.; and Pereira. F., Maximum entropy Markov models for information
extraction and segmentation, In Proceedings of ICML-2000
[Miller et al 2000] Miller, S.; Fox, H.; Ramshaw, L.; Weischedel, R. A Novel Use of Statistical Parsing to Extract Information from
Text. Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL), p. 226 - 233.
Slides from Cohen & McCallum