Machine Learning for Information Extraction

Download Report

Transcript Machine Learning for Information Extraction

15-505: Lecture 11
Generative Models for Text Classification
and Information Extraction
Kamal Nigam
Some slides from William Cohen, Andrew McCallum
Text Classification by Example
Text Classification by Example
Text Classification by Example
Text Classification by Example
Text Classification by Example
How could you build a text classifier?
• Take some ideas from machine learning
– Supervised learning setting
– Examples of each class (a few or thousands)
• Take some ideas from machine translation
– Generative models
– Language models
• Simplify each and stir thoroughly
Basic Approach of Generative Modeling
1. Pick representation for data
2. Write down probabilistic generative model
3. Estimate model parameters with training
data
4. Turn model around to calculate unknown
values for new data
Naïve Bayes: Bag of Words Representation
Occurrence
counts
All words in
dictionary
Corn prices rose
today while corn
futures dropped
in surprising
trading activity.
Corn ...
activity
cable
corn
damp
drawer
dropped
elbow
earning
1
0
3
0
0
1
0
0
.
.
.
.
.
.
Naïve Bayes: Mixture of Multinomials Model
1. Pick the class: P(class)
2. For every word, pick from the
class urn: P(word|class)
java
modem
windows
web
the
in
again
the
windows
soccer polo
dropped
the
while
in
the
soccer
activity
COMPUTERS
SPORTS
ball
Word independence assumption!
Naïve Bayes: Estimating Parameters
• Just like estimating biased coin flip probabilities
• Estimate MAP word probabilities:
 N(word , doc)
P( word | class ) 
| Vocab |   N (doc)
1
docclass
docclass
• Estimate MAP class priors:
1  N(doc, class )
P(class ) 
N(class )  N(doc)
Naïve Bayes: Performing Classification
P(doc | class )  P(class )
P(class | doc) 
P(doc)
 P(class )
 P(word | class )
worddoc
• Word independence assumption
• Take the class with the highest probability
Classification Tricks of the Trade
• Stemming
– run, runs, running, ran  run
– table, tables, tabled  table
– computer, compute, computing  compute
• Stopwords
– Very frequent function words generally uninformative
– if, in, the, like, …
• Information gain feature selection
– Keep just most indicative words in the vocabulary
Naïve Bayes Rules of Thumb
• Need hundreds of labeled examples per class
for good performance (~85% accuracy)
• Stemming and stopwords may or may not help
• Feature selection may or may not help
• Predicted probabilities will be very extreme
• Use sum of logs instead of multiplying
probabilities for underflow prevention
• Coding this up is trivial, either as a mapreduce
or not
Information Extraction with
Generative Models
Example: A Problem
Mt. Baker, the school district
Baker Hostetler, the company
Baker, a job opening
Genomics job
Example: A Solution
Job Openings:
Category = Food Services
Keyword = Baker
Location = Continental U.S.
Extracting Job Openings from the Web
Title: Ice Cream Guru
Description: If you dream of cold creamy…
Contact: [email protected]
Category: Travel/Hospitality
Function: Food Services
Potential Enabler of Faceted Search
Lots of Structured Information in Text
IE from Research Papers
What is Information Extraction?
• Recovering structured data from formatted text
What is Information Extraction?
• Recovering structured data from formatted text
– Identifying fields (e.g. named entity recognition)
What is Information Extraction?
• Recovering structured data from formatted text
– Identifying fields (e.g. named entity recognition)
– Understanding relations between fields (e.g. record
association)
What is Information Extraction?
• Recovering structured data from formatted text
– Identifying fields (e.g. named entity recognition)
– Understanding relations between fields (e.g. record
association)
– Normalization and deduplication
What is Information Extraction?
• Recovering structured data from formatted text
– Identifying fields (e.g. named entity recognition)
– Understanding relations between fields (e.g. record
association)
– Normalization and deduplication
• Today, focus on field identification
IE Posed as a Machine Learning Task
• Training data: documents marked up with
ground truth
• In contrast to text classification, local features
crucial. Features of:
–
–
–
–
…
Contents
Text just before item
Text just after item
Begin/end boundaries
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun
prefix
contents
suffix
…
Good Features for Information Extraction
Creativity and Domain Knowledge Required!
contains-question-mark
begins-with-number Example word features:
– identity of word
contains-question-word
begins-with-ordinal
– is in all caps
ends-with-question-mark
begins-with-punctuation
– ends in “-ski”
first-alpha-is-capitalized
begins-with-question-word– is part of a noun phrase
indented
– is in a list of city names
begins-with-subject
– is under node X in WordNet orindented-1-to-4
blank
Cyc
indented-5-to-10
contains-alphanum
– is in bold font
more-than-one-third-space
contains-bracketed– is in hyperlink anchor
only-punctuation
– features of past & future
number
– last person name was female prev-is-blank
contains-http
– next two words are “and
prev-begins-with-ordinal
contains-non-space
Associates”
shorter-than-30
contains-number
contains-pipe
Good Features for Information Extraction
Creativity and Domain Knowledge Required!
Is Capitalized
Is Mixed Caps
Is All Caps
Initial Cap
Contains Digit
All lowercase
Is Initial
Punctuation
Period
Comma
Apostrophe
Dash
Preceded by HTML tag
Character n-gram classifier
says string is a person
name (80% accurate)
In stopword list
(the, of, their, etc)
In honorific list
(Mr, Mrs, Dr, Sen, etc)
In person suffix list
(Jr, Sr, PhD, etc)
In name particle list
(de, la, van, der, etc)
In Census lastname list;
segmented by P(name)
In Census firstname list;
segmented by P(name)
In locations lists
(states, cities, countries)
In company name list
(“J. C. Penny”)
In list of company suffixes
(Inc, & Associates,
Foundation)
Word Features
– lists of job titles,
– Lists of prefixes
– Lists of suffixes
– 350 informative phrases
HTML/Formatting Features
– {begin, end, in} x
{<b>, <i>, <a>, <hN>} x
{lengths 1, 2, 3, 4, or longer}
– {begin, end} of line
Landscape of ML Techniques for IE:
Classify Candidates
Abraham Lincoln was born in Kentucky.
Sliding Window
Boundary Models
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
BEGIN
Classifier
Classifier
which class?
which class?
Classifier
Try alternate
window sizes:
which class?
BEGIN
Finite State Machines
Abraham Lincoln was born in Kentucky.
END
BEGIN
END
Wrapper Induction
<b><i>Abraham Lincoln</i></b> was born in Kentucky.
Most likely state sequence?
Learn and apply pattern for a website
<b>
<i>
PersonName
Any of these models can be used to capture words, formatting or both.
Sliding Windows & Boundary Detection
Information Extraction by Sliding Windows
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Information Extraction by Sliding Windows
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Information Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Information Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Information Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Information Extraction with Sliding Windows
[Freitag 97, 98; Soderland 97; Califf 98]
…
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun …
w t-m
w t-1 w t
w t+n
w t+n+1
w t+n+m
prefix
contents
suffix
• Standard supervised learning setting
– Positive instances: Windows with real label
– Negative instances: All other windows
– Features based on candidate, prefix and suffix
• Special-purpose rule learning systems work well
courseNumber(X) :tokenLength(X,=,2),
every(X, inTitle, false),
some(X, A, <previousToken>, inTitle, true),
some(X, B, <>. tripleton, true)
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
BWI: Learning to detect boundaries
[Freitag & Kushmerick, AAAI 2000]
• Another formulation: learn three probabilistic
classifiers:
– START(i) = Prob( position i starts a field)
– END(j) = Prob( position j ends a field)
– LEN(k) = Prob( an extracted field has length k)
• Then score a possible extraction (i,j) by
START(i) * END(j) * LEN(j-i)
• LEN(k) is estimated from a histogram
• START(i) and END(j) learned by boosting over simple
boundary patterns and features
Problems with Sliding Windows
and Boundary Finders
• Decisions in neighboring parts of the input are made
independently from each other.
– Sliding Window may predict a “seminar end time” before the
“seminar start time”.
– It is possible for two overlapping windows to both be above
threshold.
– In a Boundary-Finding system, left boundaries are laid down
independently from right boundaries, and their pairing
happens as a separate step.
Hidden Markov Models
Citation Parsing
•
Fahlman, Scott & Lebiere, Christian (1989). The cascade-correlation
learning architecture. Advances in Neural Information Processing
Systems, pp. 524-532.
•
Fahlman, S.E. and Lebiere, C., “The Cascade Correlation Learning
Architecture,” Neural Information Processing Systems, pp. 524-532,
1990.
•
Fahlman, S. E. (1991) The recurrent cascade-correlation learning
architecture. NIPS 3, 190-205.
Can we do this with probabilistic
generative models?
• Could have classes for {author, title, journal,
year, pages}
• Could classify every word or sequence?
– Which sequences?
• Something interesting in the sequence of
fields that we’d like to capture
– Authors come first
– Title comes before journal
– Page numbers come near the end
Hidden Markov Models: The Representation
• A document is a sequence of words
• Each word is tagged by its class
• fahlman s e and lebiere c the cascade correlation learning
architecture neural information processing systems pp 524 532
1990
HMM: Generative Model (1)
Author
Title
Year
Journal
Pages
HMM: Generative Model (2)
Author
Title
Year
Pages
HMM: Generative Model (3)
• States: xi
• State transitions: P(xi|xj) = a[xi|xj]
• Output probabilities: P(oi|xj) = b[oi|xj]
• Markov independence assumption
HMMs: Estimating Parameters
• With fully-labeled data, just like naïve Bayes
• Estimate MAP output probabilities:
 N(o , word )
1
b[oi | x j ] 
worddata@ x j
| Vocab | 
i
1
worddata@ x j
• Estimate MAP state transitions:
1
a[ xi | x j ] 
 N(x
x j data
| x|
j 1
, xi )
1
x j data
HMMs: Performing Extraction
• Given output words:
– fahlman s e 1991 the recurrent cascade correlation learning
architecture nips 3 190 205
• Find state sequence that maximizes:
 a[ x | x
i
i 1
]b[oi | xi ]
i
• Lots of possible state sequences to test (514)
Hmm…
Representation for Paths: Trellis
Representation for Paths: Trellis
HMM Example: Nymble
[Bikel, et al 97]
Task: Named Entity Extraction
• Bigram within classes
• Backoff to unigram
• Special capitalization
and number features…
Person
start-ofsentence
end-ofsentence
Org
(Five other name classes)
Other
Train on 450k words of news wire text.
Results:
Case
Mixed
Upper
Mixed
Language
English
English
Spanish
F1 .
93%
91%
90%
Nymble word features
HMMs: A Plethora of Applications
• Information extraction
• Part of speech tagging
• Word segmentation
• Gene finding
• Protein structure prediction
• Speech recognition
• Economics, Climatology, Robotics, …