Transcript Slides

Wrapper Induction & Other Use
of “Structure”
2-1-2007
Outline
•
•
•
•
Deadlines and projects and such
What wrapper learning is about
BWI and some related work
MUC7: MENE & others
Administrative issues
• Don’t forget – critiques due on Tues!
– “1/22: Paper critiques for the week should be submitted on Tuesday before
class. E.g., summaries for the four papers discussed the week of Jan 30th
(Janche and Abney, Cohen et al, Freitag & Kushmerick, Borthwick et al) should
be submitted before class Jan 30th. Submit your critiques by email (to
vitor@cs, cc wcohen@cs). Each paper critique should be about 200-500 words
(think half page or a page) discussing one thing you liked and/or one thing you
didn’t like about the paper.” [course web page]
– Please use “Subject: critique ….”
– Next week: Borkar “Datamold” paper, Frietag et al MEMM paper,
Ratnaparkhi Maxent tagger paper
• Grading:
– Grades based on project content, project write-up, class participation
and critiques
– You’ll learn more if you read and think in advance of the lectures
Administrative issues
• Some scary math:
– 23 sessions left (after this week)
– 28 people enrolled  28 student “optional” paper presentations
•
My plan: next 14-15 sessions will each have
– two 20min student presentations
– discussion/questions
– 50-60min of lecture
•
Vitor and I will construct a signup mechanism (Google spreadsheet?) and
email you the details and post on the web page
– Procrastinate intelligently
– If you don’t get email contact Vitor (we have Andrew emails for roster)
•
What you can present
– Any “optional” paper from the syllabus
– Anything else you want that’s related (check with William)
•
When you can present
– Any time after the topic has been covered in lecture
– Preferably soon after
Projects
• Next Tuesday 2/6: everyone submit an abstract (email to
William cc Vitor, and hardcopy)
– One page, covering some subset of:
•
•
•
•
•
•
•
What you plan to do
Why you think it’s interesting
Any relevant superpowers you might have
How you plan to evaluate
What techniques you plan to use
What question you want to answer
Who you might work with
– These will be posted on the class web site
• Following Tuesday 2/13:
– Similar abstract from each team
• Team is (preferably) 2-3 people, but I’m flexible
• Main new information: who’s on what team
Previous projects
•
•
•
•
“Extracting person names from email, augmented by name co-reference
information” – extension in EMNLP 2005 with 2/3 of team members
Learning to Exract Signature and Reply Lines from Email – 1 person,
extension appeared in CEAS 2004
“Evaluating CRF and MaxEnt Methods with Language-Model Derived
Features for Protein Name Extraction” – 2 students, one followed up with
published work in this area
“Filtering Web Searching Results on People” – 2 students
– paper with same idea & comparable results from different people has since
appeared in ??
•
“Semantic Role Labeling with Conditional Random Fields” – 2 students,
– one left CMU and went to Univ of Rochester
•
•
BiDirectional MEMM-like model (evaluated on secondary structure
prediction, compared to MaxEnt, CRFs) – 2 people
“Comparing SVM, HMM, MaxEnt and CRF” in Video Segmentation” – 2
students – negative results but a good grade
Some other ideas…
• Exploiting burstiness in named entities on the
web.
Some other ideas…
• Exploiting burstiness in named entities on the
web – I have some resources to help
Some other ideas
• Semi-supervised relation extraction for biology
– Large subdomains where entity extraction is easy
(e.g., extraction of yeast proteins) but relation
extraction is hard (e.g., determining if “A inhibits B” or
if “A binds to B”).
– This gives you unlabeled examples of interactions
– There are also small amounts of labeled interaction
data
– There are also large amounts of “weakly labeled” data
• For many proteins we know (e.g.) how A and B actually do
interact.
Other project ideas
Project: Semi-supervised learning of named entity extractors from the web.
The goal of this project is to develop a cotraining-style approach to learning named
entity extractors using the web plus a handful of seed examples (e.g., to train an
extractor for the class “university” we might provide names of half a dozen
universities). The key to cotraining is to identify multiple redundant views of the data,
and corresponding classifiers for each view, that can be used to cotrain each
other. For example, three possible views for classifying named entities are (1) HTML
web page structure such as “if X appears as an item in an HTML list, and the other
items in the list a person names, then X is probably a person name,” (2) features of the
entity tokens such as “Capitalized tokens that end with ‘ski’ are likely to be last names
of people”, and (3) information in URLS such as “URLS of the form ‘www.<XXXX>.edu’
suggest that <XXXX> is a university name or an abbreviation of one. The first step of
this project will be to identify an interesting set of views such as the above, which
complement those we are already developing. The second step will be to implement a
cotraining method to learn the different types of classifiers jointly from mostly
unlabeled data.
Tom Mitchell has volunteered to work with the student team on this project. The project
would fit into the larger “Read the Web” research project, for which Google has
provided special permission to make a large number of queries to their web search
engine. Please contact [email protected] if you would like to discuss this.
Other project ideas
Project: Joint learning of named-entity extractors and relation extractors.
The goal of this project is to develop more accurate learning methods for training
named-entity extractors and relation extractors, by coupling these two learning
problems into a single joint problem. Consider, for example, learning named entity
extractors for ‘person(x)’ and for ‘university(x)’ and also learning relation extractors for
the relations advisor_of(x,y) and affiliated_with(a,b). One could treat these as
independent learning tasks. However, they can be coupled because the relations are
typed: for advisor_of(x,y) to be true, it must also be true that person(x) and
person(y). Let’s design a learning method that learns these jointly, resulting in better
learning from less total labeled data. A possible extension to this is to consider
methods that automatically find useful subclasses of the named entities, where
“useful” means that learning this subclass helps with relation extraction. For example,
a potentially useful subclass of ‘person’ is ‘student,’ because this is a more useful class
for y in extracting instances of the advisor_of(x,y) relation.
Tom Mitchell has volunteered to work with the team on this project. The project would
fit into the larger “Read the Web” research project, for which Google has provided
special permission to make a large number of queries to their web search
engine. Please contact [email protected] if you would like to discuss this.
Outline
•
•
•
•
Deadlines and projects and such
What wrapper learning is about
BWI and some related work
MUC7 and MENE
Managing information from many
places
Data Integration Problems:
1. Heterogeneous databases have
different schemas (which the user
or integrator must understand)
X:advisor(wc,Y)&affil(X,lti) ?
Query
{X=em; X=vc}
Answer
inference
2. Heterogeneous databases have
different object identifiers.
advisor(wc,vc)
advisor(yh,tm)
affil(wc,mld)
from(vc,lti)
affil(vc,lti)
fn(wc,``William”)
first(vc,``Vitor”)
fn(vc,``Vitor”)
AND
Wrappers
Goal: learn from a
human teacher how
to extract certain
database records
from a particular
web site.
Learner
Why learning from few examples is
important
At training time, only four
examples are available—but
one would like to generalize
to future pages as well…
Must generalize across time as well as across a single site
Some history
Kushmerick’s WIEN system
• Earliest wrapper-learning system
(published IJCAI ’97)
• Special things about WIEN:
– Treats document as a string of characters
– Learns to extract a relation directly, rather
than extracting fields, then associating them
together in some way
– Example is a completely labeled page
WIEN system: a sample wrapper
WIEN system: a sample wrapper
Left delimiters L1=“<B>”, L2=“<I>”; Right R1=“</B>”, R2=“</I>”
WIEN system: a sample wrapper
Learning means finding L1,…,Lk and R1,…,Rk
• Li must precede every instance of field i
• Ri must follow every instance of field I
• Li, Ri can’t contain data items
• Limited number of possible candidates for Li,Ri
WIEN system: a more complex
class of wrappers (HLRT)
H = “<P>”, T = “<HR>”
Extension: use
Li,Ri delimiters
only: after a
“head” (after first
occurence of H)
and before a “tail”
(occurrence of T)
Kushmeric: overview of various
extensions to LR
Wrapper Induction History
• Kushmerick ’97
• STALKER ’98-’01
–  Commercialization as
FETCH.com ~= 2000
– Rivals: Connotate, etc
• Semisupervised tree-structure
learning (2002,2004,2005)
Kushmeric and Frietag: Boosted
wrapper induction
Review of boosting
Generalized version of AdaBoost (Singer&Schapire, 99)
Allows “real-valued” predictions for each “base
hypothesis”—including value of zero.
Learning methods: boosting rules
Weak learner: to find weak hypothesis t:
1. Split Data into Growing and Pruning sets
2. Let Rt be an empty conjunction
3. Greedily add conditions to Rt guided by Growing set:
Constraint: W+ > W4. Greedily remove conditions from Rt guided by Pruning set:
5. Convert to weak hypothesis:
where
and caret is smoothing
Learning methods: boosting rules
SLIPPER also produces fairly compact rule sets.
Learning methods: BWI
• Boosted wrapper induction (BWI) learns to
extract substrings from a document.
– Learns three concepts: firstToken(x), lastToken(x),
substringLength(k)
– Conditions are tests on tokens before/after x
• E.g., toki-2=‘from’, isNumber(toki+1)
– SLIPPER weak learner, no pruning.
– Greedy search extends “window size” by at most L
in each iteration, uses lookahead L, no fixed limit
on window size.
• Good results in (Kushmeric and Frietag, 2000)
BWI algorithm
BWI algorithm
Lookahead
search here
BWI example rules
BWI and SWI
• Kautchak, Smarr & Elkan, 2004: “Sources
of Success for Information Extraction
Methods”, Journal of Machine Learning
Research, 5, 499 - 527.
– SWI is a “hard” version of BWI, using setcovering instead of boosting
Websites from DBs
Seminars, Job Ads
BioEntities in
Medline Abstracts
Secondary
regularities?
Same
#rounds as
SWI
BWI rule
wts
No
negative
examples
covered
= (#rules)/(#posExamples)
Cohen et al
Improving A Page Classifier
with Anchor Extraction
and Link Analysis
William W. Cohen
NIPS 2002
•Previous work in page classification using links:
• Exploit hyperlinks (Slattery&Mitchell 2000;
Cohn&Hofmann, 2001; Joachims 2001):
Documents pointed to by the same “hub” should
have the same class.
• What’s new in this paper:
• Use structure of hub pages (as well as structure
of site graph) to find better “hubs”
• Adapt an existing “wrapper learning” system to
find structure, on the task of classifying “executive
bio pages”.
Intuition: links from this
“hub page” are informative…
…especially these links
Task: train a page classifier,
then use it to classify pages on
a new, previously-unseen web
site as executiveBio or other
Question: can index pages for
executive biographies be used
to improve classification?
Idea: use the wrapper-learner to
learn to extract links to
execBio pages, smoothing the
“noisy” data produced by the
initial page classifier.
Background: “co-training” (Mitchell&Blum,
‘98)
• Suppose examples are of the form (x1,x2,y) where x1,x2
are independent (given y), and where each xi is
sufficient for classification, and unlabeled examples
are cheap.
– (E.g., x1 = bag of words, x2 = bag of links).


• Co-training algorithm:
1. Use x1’s (on labeled data D) to train f1(x)=y
2. Use f1 to label additional unlabeled examples U.
3. Use x2’s (on labeled part of U+D to train f1(x)=y
4. Repeat . . .
Simple 1-step co-training for web
pages
f1 is a bag-of-words page classifier, and S is web site
containing unlabeled pages.
• Feature construction. Represent a page x in S as a
bag of pages that link to x (“bag of hubs”).
• Learning. Learn f2 from the bag-of-hubs examples,
labeled with f1
• Labeling. Use f2(x) to label pages from S.
Idea: use one round of co-training to bootstrap the bag-of words
classifier to one that uses site-specific features x2/f2
Improved 1-step co-training for web
pages
Feature construction.
- Label an anchor a in S as positive iff it points to a positive
page x (according to f1). Let D = {(x’,a): a is a positive
anchor on x’}.
- Generate many small training sets Di from D, by sliding
small windows over D.
- Let P be the set of all “structures” found by any builder
from any subset Di
- Say that p links to x if p extracts an anchor that points to x.
Represent a page x as the bag of structures in P that link
to x.
Learning and Labeling. As before.
List1
builder
extractor
builder
extractor
List2
builder
extractor
List3
BOH representation:
{ List1, List3,…}, PR
{ List1, List2, List3,…}, PR
{ List2, List 3,…}, Other
{ List2, List3,…}, PR
…
Learner
Experimental results
0.25
0.2
0.15
Winnow
0.1
D-Tree
None
0.05
0
1
None
2
3
Co-training hurts
4
5
6
7
Winnow
8
9
No improvement
Experimental results
Summary
- “Builders” (from a wrapper learning system) let one
discover and use structure of web sites and
index pages to smooth page classification results.
- Discovering good “hub structures” makes it
possible to use 1-step co-training on small (50200 example) unlabeled datasets.
– Average error rate was reduced from 8.4% to
3.6%.
– Difference is statistically significant with a 2tailed paired sign test or t-test.
– EM with probabilistic learners also works—see
(Blei et al, UAI 2002)
MUC-7
• Last Message Understanding Conference,
(fore-runner to ACE), about 1998
– 200 articles in development set (aircraft
accidents)
– 200 articles in final test (launch events)
– Names of persons, organizations, locations,
dates, times, currency & percentage
LTG
NetOwl
Commercial RBS
Identifinder (HMMs)
MENE+Proteus
Manitoba
(NB filtered names)
Borthwick et al: MENE system
• Much like MXPost, with some tricks for NER:
– 4 tags/field: x_start, x_continue, x_end, x_unique
– Features:
•
•
•
•
•
Section features
Tokens in window
Lexical features of tokens in window
Dictionary features of tokens (is token a firstName?)
External system of tokens (is this a NetOwl_company_start?
proteus_person_unique?)
• Smooth by discarding low-count features
– No history: viterbi search used to find best consistent
tag sequence (e.g. no continue w/o start)
Dictionaries in MENE
MENE results (dry run)
MENE learning curves
92.2
93.3
96.3
• Largest U.S. Cable Operator Makes Bid for Walt Disney
•
By ANDREW ROSS SORKIN
Longer names
• The Comcast Corporation, the largest cable television
operator in the United States, made a $54.1 billion unsolicited
takeover bid today for The Walt Disney Company, the storied
family entertainment colossus.
• If successful, Comcast's audacious bid would once again
reshape the entertainment landscape, creating a new media
behemoth that would combine the power of Comcast's powerful
distribution channels to some 21 million subscribers in the
nation with Disney's vast library of content and production
assets. Those include its ABC television network, ESPN and
other cable networks, and the Disney and Miramax movie
studios.
Short names
LTG system
• Another MUC-7 competitor
• Handcoded rules for “easy” cases (amounts, etc)
• Process of repeated tagging and “matching” for hard
cases
– Sure-fire (high precision) rules for names where type is clear
(“Phillip Morris, Inc – The Walt Disney Company”)
– Partial matches to sure-fire rule are filtered with a maxent
classifier (candidate filtering) using contextual information, etc
– Higher-recall rules, avoiding conflicts with partial-match output
“Phillip Morris announced today…. - “Disney’s ….”
– Final partial-match & filter step on titles with different learned
filter.
• Exploits discourse/context information
LTG Results
LTG
NetOwl
Commercial RBS
Identifinder (HMMs)
MENE+Proteus
Manitoba
(NB filtered names)