Bootstrapping for Learning Tasks
Download
Report
Transcript Bootstrapping for Learning Tasks
Bootstrapping for Text Learning
Tasks
Ramya Nagarajan
AIML Seminar
March 6, 2001
Preamble
Bootstrapping for Text Learning Tasks.
(1999) Jones, R., McCallum, A., Nigam,
K., and Riloff, E.
From the IJCAI-99 Workshop on Text
Mining: Foundations, Techniques and
Applications
March 27: Ellen Riloff
– http://www.cs.utah.edu/~riloff
Introduction
Learning algorithms require lots of
labeled training data
– time-consuming & tedious!
Bootstrapping = small quantity of
labeled data (seed) + large quantity of
unlabeled data
– can be used for text learning tasks that
otherwise require large training sets
unlabeled data obtained automatically
Case Studies - 1
learning extraction patterns and dictionaries
for information extraction
– Supplied knowledge = keywords & parser
noun phrase classifier & NP context classifier
(based on extraction patterns)
– given noun phrases as seed
generate dictionaries for locations from
corporate web pages
– 76% accuracy after 50 iterations
Case Studies -2
document classification using a naïve
Bayes classifier
– provide keywords for each class & class
hierarchy
classification of computer science
papers
– 66% accuracy (compare to human
agreement levels of 72%)
Information Extraction
IE = identifying predefined types of
information from text
extraction patterns + semantic lexicon
(words/phrases with semantic category
labels)
Name:
%Murdered%
Event Type:
MURDER
Trigger Word:
murdered
Slots:
VICTIM <subject> (human)
PERPETRATOR: <prep-phrase, by> (human)
Information Extraction
previous extraction systems require
– training corpus with annotations for desired
extractions
– manually defined keywords, frames or
object recognizers
Bootstrapping technique uses texts from
the domain & small set of seed words
Information extraction
based on two observations:
– if “schnauzer”, “terrier”, “dalmation” refer to
dogs discover pattern “<X> barked”
– if we know “<X> barked” is good pattern for
extracting dogs every NP it extracts
refers to a dog
mutual bootstrapping = seed words of
semantic category learned extraction
patterns new category members
Mutual Bootstrapping
Generate all candidate extraction patterns
from the training corpus using AutoSlog (a
tool that builds dictionaries of extraction
patterns)
Apply candidate extraction patterns to training
corpus & save the patterns with their
extractions
Next stage: label semantic categories of
extraction patterns & NPs
Mutual Bootstrapping Overview
Mutual Bootstrapping
Select best EP
Temp
Semantic
lexicon
Extraction
Phrase list
Add best EP’s
extractions
Mutual Bootstrapping (cont.)
Score extraction patterns more general
patterns are scored higher & use head
phrase matching
Scoring also uses RlogF metric:
score(patterni) = Ri * log2(Fi)
identifies most reliable extraction patterns &
patterns that frequently extract relevant info.
(irrelevant info may also be extracted)
e.g. Kidnapped in <location> vs. kidnapped
in January
Problems…
“shot in <X>”: location or body part?
extracting many body parts as extraction
patterns for location category low accuracy
save 5 most reliable NPs from bootstrapping
process
• restart inner bootstrapping process again
reliable NP = one extracted by many
extraction patterns
Meta-Bootstrapping
Candidate extraction
patterns & extractions
Seed
words
Mutual Bootstrapping
initialize
Permanent
Semantic
lexicon
add 5
best NPs
Select best EP
Temp
Semantic
lexicon
Extraction
Phrase list
Add best EP’s
extractions
Results
Seed words (terrorist locations): bolivia, city,
columbia ….
Location patterns extracted by metabootstrapping after 50 iterations
–
–
–
–
Kidnapped in <X>
Taken in <X>
Operates in <X>
Billion in <X>
76% of hypothesized location phrases were
true locations
Related Work
DIPRE algorithm of Brin (1998) uses
bootstrapping to extract (title, author)
pairs for books on WWW.
Yarowsky (1995) used bootstrapping
algorithm for word sense
disambiguation task
Nigam (1999) used a few labeled
documents instead of keywords
References
Bootstrapping for Text Learning Tasks.
(1999) Jones, R., McCallum, A., Nigam,
K., and Riloff, E.
Learning Dictionaries for Information
Extraction by Multi-Level Bootstrapping.
(1999) Riloff, E. and Jones, R.
Foundations of Statistical Natural
Language Processing. Manning and
Schütze.