Who needs this Simpsons book?
Download
Report
Transcript Who needs this Simpsons book?
Confidence-Weighted
Linear Classification
Mark Dredze, Koby Crammer
University of Pennsylvania
Fernando Pereira
Penn Google
• Big datasets, large
number of features
• Many features are
only weakly
correlated with
target label
• Heavy-tailed feature
distribution
• Linear classifiers:
features are
associated with
word counts
1
Count
Natural Language Processing
Feature Rank
Sentiment Classification
• Who needs this Simpsons book?
You DOOOOOOOO
This is one of the most extraordinary
volumes I've ever encountered
encapsulating a television series … .
Exhaustive, informative, and ridiculously
entertaining, it is the best accompaniment
to the best television show … . Even if you
only "enjoy" the Simpsons (as opposed to
being a raving fanatic, which most people
who watch the show are, after all … Very
highly recommended!
2
Sentiment Classification
• Many positive reviews with the word best
Wbest
• Later negative review
– “boring book – best if you want to sleep in seconds”
• Linear update will reduce both
Wbest Wboring
• But best appeared more than boring
• How to adjust weights at different rates?
Wboring Wbest
3
Linear Classifiers
Instance to
be
classified
Weight
vector
4
Online Learning
• Memory-efficient, simple to implement, often
competitive
• Successive rounds
–
–
–
–
–
5
Get an input instance
Output a prediction
Receive a feedback label
Compute loss
Update the prediction rule
Span-based Update Rules
• The weight vector is a linear combination of
examples
• Two rate schedules (among others):
Weight of
Learning rate
Learning rate
– Perceptron
algorithm,
feature
f
– Passive-aggressive
6
Target label, -1
conservative:
or 1
Value of feature f
of instance
Distributions in Version Space
QuickTime™ and a
decompressor
are needed to see this picture.
Mean weight-vector
Qu
ickTime™anda
com
ar enee
dde
ed
t oprs
ees
esor
this p
ict ure.
Example
7
Margin as a Random Variable
• Signed margin
is a Gaussian-distributed variable
• Thus:
8
Weight Vector (Version) Space
Place most of the
probability mass in this
region
9
Passive Step
Nothing to do, most
weight vectors already
classify the example
correctly
10
Aggressive Step
Mean moved past
the mistake line
(large margin)
The covariance is
Project the current
shirked in the
Gaussian distribution
direction of the
onto the half-space
new example
11
PA-like Update
• PA:
Confidence
Parameter
• New Update :
12
The Optimization Problem
QuickTime™ and a
decompressor
are needed to see this picture.
convex
such that
where
13
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™and
andaa
QuickTime™
QuickTime™
and a
decompressor
decompressor
decompressor
areneeded
neededto
tosee
seethis
thispicture.
picture.
are
are needed
to see
this picture.
not convex!
Simplified Optimization Problem
QuickTime™ and a
decompressor
are needed to see this picture.
convex!
• Exact variance method
• High dimension restrict to diagonal
14
Approximate Diagonal Algorithm
• Approximate by projecting onto the diagonal
• Closed form solution
QuickTime™ and a
decompressor
are needed to see this picture.
variable learning rate
scaled counts (binary features)
15
Visualizing Learning Rates
30 rounds
Full
• 20 features, only two informative
• Covariance ellipses (20 x cov)
– Black: the two informative features
– Blue: several pairs of noise features
– Green: target weight vector
16
Diagonal
Visualizing Learning Rates
90 rounds
Full (2 x)
17
Diagonal (20 x)
Experiments
• Online to batch :
– Multiple passes over the training data
– Evaluate on a different test set after each
pass
– Compute error/accuracy
• Binary problems from
– Newsgroups
– Reuters
• Sentiment classification
18
Data
• Sentiment
– Sentiment reviews from 6 Amazon domains (Blitzer et al)
– Classify a product review as either positive or negative
• Reuters, pairs of labels
– Three divisions:
• Insurance: Life vs. Non-Life, Business Services: Banking vs. Financial, Retail
Distribution: Specialist Stores vs. Mixed Retail.
– Bag of words representation with binary features.
• 20 newsgroups, pairs of labels
– Three divisions:
• comp.sys.ibm.pc.hardware vs. comp.sys.mac.hardware.instances,
sci.electronics vs. sci.med.instances, and talk.politics.guns vs.
talk.politics.mideast.instances.
– Bag of words representation with binary features.
19
Typical Performance (20 NG)
20
Summary Statistics
21
Parallel Training
• Split large data into disjoint sets
• Train using each set independently
• Combine resulting classifiers
– Average Average performance of individual classifiers
– Uniform mean of weight vectors
– Weighted mean of weight vectors using confidence
information
22
Parallel Training
•
•
•
•
23
Sentiment ~1M ; Reuters ~0.8M
#Features/#Docs: Sentiment ~13 ; Reuters ~0.35
Performance degrades with number of splits
Weighting improves performance
Summary
• Online learning is fast and effective …
… but NLP data has skewed feature distributions
• Confidence-weighted linear prediction represents
explicitly how much to trust feature weights
– Higher accuracy than previous methods
– Faster convergence
– Effective model combination
• Current work:
– Direct convex version of original optimization
– Batch algorithm
– Explore full covatiance versions
24