Bootstrapping for Learning Tasks

Download Report

Transcript Bootstrapping for Learning Tasks

Bootstrapping for Text Learning
Tasks
Ramya Nagarajan
AIML Seminar
March 6, 2001
Preamble



Bootstrapping for Text Learning Tasks.
(1999) Jones, R., McCallum, A., Nigam,
K., and Riloff, E.
From the IJCAI-99 Workshop on Text
Mining: Foundations, Techniques and
Applications
March 27: Ellen Riloff
– http://www.cs.utah.edu/~riloff
Introduction

Learning algorithms require lots of
labeled training data
– time-consuming & tedious!

Bootstrapping = small quantity of
labeled data (seed) + large quantity of
unlabeled data
– can be used for text learning tasks that
otherwise require large training sets

unlabeled data obtained automatically
Case Studies - 1

learning extraction patterns and dictionaries
for information extraction
– Supplied knowledge = keywords & parser

noun phrase classifier & NP context classifier
(based on extraction patterns)
– given noun phrases as seed

generate dictionaries for locations from
corporate web pages
– 76% accuracy after 50 iterations
Case Studies -2

document classification using a naïve
Bayes classifier
– provide keywords for each class & class
hierarchy

classification of computer science
papers
– 66% accuracy (compare to human
agreement levels of 72%)
Information Extraction


IE = identifying predefined types of
information from text
extraction patterns + semantic lexicon
(words/phrases with semantic category
labels)
Name:
%Murdered%
Event Type:
MURDER
Trigger Word:
murdered
Slots:
VICTIM <subject> (human)
PERPETRATOR: <prep-phrase, by> (human)
Information Extraction

previous extraction systems require
– training corpus with annotations for desired
extractions
– manually defined keywords, frames or
object recognizers

Bootstrapping technique uses texts from
the domain & small set of seed words
Information extraction

based on two observations:
– if “schnauzer”, “terrier”, “dalmation” refer to
dogs  discover pattern “<X> barked”
– if we know “<X> barked” is good pattern for
extracting dogs  every NP it extracts
refers to a dog

mutual bootstrapping = seed words of
semantic category  learned extraction
patterns  new category members
Mutual Bootstrapping



Generate all candidate extraction patterns
from the training corpus using AutoSlog (a
tool that builds dictionaries of extraction
patterns)
Apply candidate extraction patterns to training
corpus & save the patterns with their
extractions
Next stage: label semantic categories of
extraction patterns & NPs
Mutual Bootstrapping Overview
Mutual Bootstrapping
Select best EP
Temp
Semantic
lexicon
Extraction
Phrase list
Add best EP’s
extractions
Mutual Bootstrapping (cont.)




Score extraction patterns  more general
patterns are scored higher & use head
phrase matching
Scoring also uses RlogF metric:
score(patterni) = Ri * log2(Fi)
identifies most reliable extraction patterns &
patterns that frequently extract relevant info.
(irrelevant info may also be extracted)
e.g. Kidnapped in <location> vs. kidnapped
in January
Problems…



“shot in <X>”: location or body part?
extracting many body parts as extraction
patterns for location category  low accuracy
save 5 most reliable NPs from bootstrapping
process
• restart inner bootstrapping process again

reliable NP = one extracted by many
extraction patterns
Meta-Bootstrapping
Candidate extraction
patterns & extractions
Seed
words
Mutual Bootstrapping
initialize
Permanent
Semantic
lexicon
add 5
best NPs
Select best EP
Temp
Semantic
lexicon
Extraction
Phrase list
Add best EP’s
extractions
Results


Seed words (terrorist locations): bolivia, city,
columbia ….
Location patterns extracted by metabootstrapping after 50 iterations
–
–
–
–

Kidnapped in <X>
Taken in <X>
Operates in <X>
Billion in <X>
76% of hypothesized location phrases were
true locations
Related Work



DIPRE algorithm of Brin (1998) uses
bootstrapping to extract (title, author)
pairs for books on WWW.
Yarowsky (1995) used bootstrapping
algorithm for word sense
disambiguation task
Nigam (1999) used a few labeled
documents instead of keywords
References



Bootstrapping for Text Learning Tasks.
(1999) Jones, R., McCallum, A., Nigam,
K., and Riloff, E.
Learning Dictionaries for Information
Extraction by Multi-Level Bootstrapping.
(1999) Riloff, E. and Jones, R.
Foundations of Statistical Natural
Language Processing. Manning and
Schütze.