NYU-CSx - Panos Ipeirotis

Download Report

Transcript NYU-CSx - Panos Ipeirotis

Get Another Label?
Improving Data Quality and Machine
Learning Using Multiple, Noisy Labelers
Panos Ipeirotis
New York University
Joint work with Jing Wang, Foster
Provost, and Victor Sheng
3
4
5
Outsourcing machine learning
preprocessing
6

Traditionally, modeling teams have invested
substantial internal resources in data formulation,
information extraction, cleaning, and other
preprocessing

Now, we can outsource preprocessing tasks, such as
labeling, feature extraction, verifying information
extraction, etc.
–
using Mechanical Turk, oDesk, etc.
–
quality may be lower than expert labeling (much?)
–
but low costs can allow massive scale
Example: Build an “Adult Web Site” Classifier
Need a large number of hand-labeled sites
 Get people to look at sites and classify them as:
G (general audience) PG (parental guidance) R (restricted) X (porn)

8
Example: Build an “Adult Web Site” Classifier
Need a large number of hand-labeled sites
 Get people to look at sites and classify them as:
G (general audience) PG (parental guidance) R (restricted) X (porn)

Cost/Speed Statistics
 Undergrad intern: 200 websites/hr, cost: $15/hr
 MTurk: 2500 websites/hr, cost: $12/hr
Noisy labels can be problematic
Many tasks rely on high-quality labels for objects:
–
–
–
–
–
–
–

10
webpage classification for safe advertising
learning predictive models
searching for relevant information
finding duplicate database records
image recognition/labeling
song categorization
sentiment analysis
Noisy labels can lead to degraded task
performance
Quality and Classification Performance
Labeling quality increases  classification quality increases
P = 100%
100
P = 80%
AUC
90
80
P = 60%
70
60
P = 50%
50
Number of examples ("Mushroom" data set)
11
300
280
260
240
220
200
180
160
140
120
100
80
60
40
20
1
40
P: Single-labeler quality
(probability of assigning
correctly a binary label)
Solutions

Get better labelers
–
12
Often beyond our
control or too
expensive

Get more labelers
per item
–
Our focus
Majority Voting and Label Quality

Ask multiple labelers, keep majority label as “true” label

Quality is probability of being correct
1
P is probability
of individual labeler
being correct
Integrated quality
0.9
P=1.0
P=0.9
0.8
P=0.8
0.7
P=0.7
0.6
P=0.6
0.5
P=0.5
0.4
P=0.4
0.3
0.2
13
1
3
5
7
9
Number of labelers
11
13
Tradeoffs for Modeling


Get more examples  Improve classification
Get more labels  Improve label quality  Improve classification
P = 1.0
100
P = 0.8
Accuracy
90
80
P = 0.6
70
60
P = 0.5
50
14
80
10
0
12
0
14
0
16
0
18
0
20
0
22
0
24
0
26
0
28
0
30
0
60
40
20
1
40
Number of examples (Mushroom)
Basic Labeling Strategies


Single Labeling
–
Get as many data points as possible
–
One label each
Round-robin Repeated Labeling
–
15
Repeatedly label data points, giving next label to the
one with the fewest so far
Round Robin vs. Single: Low Noise
Single
FRR
(50 examples)
p= 0.8, labeling quality
#examples =50
16
With low noise/few examples, more (single labeled) examples better
Round Robin vs. Single: High Noise
FRR
(100 examples)
SL
p= 0.6, labeling quality
#examples =100
17
High noise/many examples, repeated labeling better
Selective Repeated-Labeling

We have seen so far:
–
With enough examples and noisy labels, getting multiple
labels is better than single-labeling
–
When we consider extra cost for getting the unlabeled part
the benefit is magnified

Can we do better than the basic strategies?

Key observation: we have additional information to
guide selection of data for repeated labeling
 the current multiset of labels

18
Example: {+,-,+,-,-,+} vs. {+,+,+,+,+,+}
Natural Candidate: Entropy

Entropy is a natural measure of label uncertainty:
| S |
| S | | S |
| S |
E (S )  
log 2

log 2
|S|
|S| |S|
|S|


E({+,+,+,+,+,+})=0
E({+,-, +,-, -,+ })=1
| S  |: positive | S  |: negative
Strategy: Get more labels for high-entropy label multisets
19
Labeling quality
What Not to Do: Use Entropy
1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
Entropy: Improves at first,
hurts in long run
GRR
ENTROPY
0
20
1000
2000
3000
Number of labels (mushroom, p=0.6)
4000
Why not Entropy

In the presence of noise, entropy will be high
even with many labels

Entropy is scale invariant
{3+, 2-} has same entropy as {600+ , 400-}, i.e., 0.97
21
Estimating Label Uncertainty (LU)

Observe +’s and –’s and compute Pr{+|obs} and Pr{-|obs}

Use Bayesian estimate of Bernoulli: Beta prior + update

Uncertainty = tail of beta distribution
Beta probability density function
SLU
22
0.0
0.5
1.0
Label Uncertainty




23
p=0.7
5 labels (3+, 2-)
Entropy ~ 0.97
CDFb=0.34
Label Uncertainty




24
p=0.7
10 labels (7+, 3-)
Entropy ~ 0.88
CDFb=0.11
Label Uncertainty




25
p=0.7
20 labels (14+, 6-)
Entropy ~ 0.88
CDFb=0.04
Labeling quality
Label Uncertainty vs. Round Robin
similar results across a dozen data sets
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
GRR
LU
0
26
1000
2000
3000
Number of labels (mushroom, p=0.6)
Remember: GRR already better than single labeling
4000
Why LU is a hack?
More sophisticated label uncertainty

Observe +’s and –’s and compute Pr{+|obs} and Pr{-|obs}

Estimated Beta distribution = quality of labelers for the
example
Estimate the prior Pr(+) by iterating
28
More sophisticated LU improves labeling
quality under class imbalance and fixes
some pesky LU learning curve glitches
31
Both techniques
perform
essentially
optimally with
balanced classes
Another strategy:
Model Uncertainty (MU)

- -- - -- -- -+- - - - -- - ---- ------
Learning models of the data provides
an alternative source of information
about label certainty
–
32
+
+
+
+
+ +
+ ++
+
+ + +
+ + +
+ +
+
+
(a random forest for the results to come)

Model uncertainty: get more labels for
instances that cause model uncertainty

Intuition?
–
for modeling: why improve training data
quality if model already is certain there?
–
for data quality, low-certainty “regions” may
be due to incorrect labeling of corresponding
instances
Examples
Models
Self-healing process
[Brodley et al, 99]
Yet another strategy:
Label & Model Uncertainty (LMU)

Label and model uncertainty (LMU): avoid
examples where either strategy is certain
S LMU 
33
S LU  S MU
Label Quality
Label + Model
Uncertainty
Label
Uncertainty
Labeling quality
Model Uncertainty alone
also improves quality
34
1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0
Uniform,
round robin
UNF
LU
MU
LMU
400
800
1200
1600
2000
Number of labels (waveform, p=0.6)
Across 12 domains, LMU is always better
than GRR. LMU is statistically significantly
better than LU and MU.
Model Quality
Label & Model
Uncertainty
100
Accuracy
95
90
85
80
75
GRR
LU
70
35
0
1000
2000
3000
Number of labels (sick, p=0.6)
MU
LMU
4000
Adult content classification
36
Example: Build an “Adult Web Site” Classifier
Need a large number of hand-labeled sites
 Get people to look at sites and classify them as:
G (general audience) PG (parental guidance) R (restricted) X (porn)

Cost/Speed Statistics
 Undergrad intern: 200 websites/hr, cost: $15/hr
 MTurk: 2500 websites/hr, cost: $12/hr
Bad news: Spammers!
Worker ATAMRO447HWJQ
labeled X (porn) sites as G (general audience)
Redundant votes, infer quality
Look at our spammer friend ATAMRO447HWJQ
together with other 9 workers
 Using redundancy, we can compute error rates
for each worker
Repeated Labeling, EM, and
Confusion Matrices (Dawid & Skene, 1979)
Iterative process to estimate worker error rates
1. Initialize“correct” label for each object (e.g., use majority vote)
2. Estimate error rates for workers (using “correct” labels)
3. Estimate “correct” labels (using error rates, weight worker
votes according to quality)
4. Go to Step 2 and iterate until convergence
Error rates for ATAMRO447HWJQ
P[G → G]=99.947%
P[X → G]=99.153%
P[G → X]=0.053%
P[X → X]=0.847%
Our friend ATAMRO447HWJQ
marked almost all sites as G.
Seems like a spammer…
Challenge: From Confusion
Matrixes to Quality Scores
The algorithm generates “confusion matrixes” for workers
Error rates for ATAMRO447HWJQ


P[X → X]=0.847%
P[G → X]=0.053%
P[X → G]=99.153%
P[G → G]=99.947%
How to check if a worker is a spammer
using the confusion matrix?
(hint: error rate not enough)
Challenge 1:
Spammers are lazy and smart!
Confusion matrix for spammer
Confusion matrix for good worker




P[X → X]=0% P[X → G]=100%
P[G → X]=0% P[G → G]=100%
P[X → X]=80%
P[G → X]=20%
P[X → G]=20%
P[G → G]=80%

Spammers figure out how to fly under the radar…

In reality, we have 85% G sites and 15% X sites

Errors of spammer = 0% * 85% + 100% * 15% = 15%
Error rate of good worker = 85% * 20% + 85% * 20% = 20%

False negatives: Spam workers pass as legitimate
Challenge 2:
Humans are biased!
Error rates for CEO of AdSafe
P[G → G]=20.0%
P[P → G]=0.0%
P[R → G]=0.0%
P[X → G]=0.0%
P[G → P]=80.0%
P[P → P]=0.0%
P[R → P]=0.0%
P[X → P]=0.0%
P[G → R]=0.0%
P[P → R]=100.0%
P[R → R]=100.0%
P[X → R]=0.0%
P[G → X]=0.0%
P[P → X]=0.0%
P[R → X]=0.0%
P[X → X]=100.0%
 In reality, we have 85% G sites, 5% P sites, 5% R sites,
5% X sites
 Errors of spammer (all in G) = 0% * 85% + 100% * 15% = 15%
 Error rate of biased worker = 80% * 85% + 100% * 5% = 73%
False positives: Legitimate workers appear to be spammers
Solution: Reverse errors first,
compute error rate afterwards
Error Rates for biased worker
P[G → G]=20.0%
P[P → G]=0.0%
P[R → G]=0.0%
P[X → G]=0.0%




P[G → P]=80.0%
P[P → P]=0.0%
P[R → P]=0.0%
P[X → P]=0.0%
P[G → R]=0.0%
P[P → R]=100.0%
P[R → R]=100.0%
P[X → R]=0.0%
P[G → X]=0.0%
P[P → X]=0.0%
P[R → X]=0.0%
P[X → X]=100.0%
When biased worker says G, it is 100% G
When biased worker says P, it is 100% G
When biased worker says R, it is 50% P, 50% R
When biased worker says X, it is 100% X
Small ambiguity for “R-rated” votes but other than that, fine!
Solution: Reverse errors first,
compute error rate afterwards
Error Rates for spammer: ATAMRO447HWJQ
P[G → G]=100.0%
P[P → G]=100.0%
P[R → G]=100.0%
P[X → G]=100.0%




P[G → P]=0.0%
P[P → P]=0.0%
P[R → P]=0.0%
P[X → P]=0.0%
P[G → R]=0.0%
P[P → R]=0.0%
P[R → R]=0.0%
P[X → R]=0.0%
P[G → X]=0.0%
P[P → X]=0.0%
P[R → X]=0.0%
P[X → X]=0.0%
When spammer says G, it is 25% G, 25% P, 25% R, 25% X
When spammer says P, it is 25% G, 25% P, 25% R, 25% X
When spammer says R, it is 25% G, 25% P, 25% R, 25% X
When spammer says X, it is 25% G, 25% P, 25% R, 25% X
[note: assume equal priors]
The results are highly ambiguous. No information provided!
Computing Quality Scores

Cost of “soft” label <p1, p2, …., pL >
with cost cij for labeling as j an object of class i


“Soft” labels with probability mass in a single class are good
“Soft” labels with probability mass spread across classes are bad
Quality = 1 – Cost(worker) / Cost(spammer)
Cost (spammer) = Replace pi with Priori
Experimental Results






500 web pages in G, P, R, X
100 workers per page (just to evaluate effect of more labels)
339 workers
Lots of noise!
95% accuracy with majority vote (only!)
Dropped all workers with quality score 50% or below

Error rate: 1% of labels dropped
Quality scores: 30% of labels dropped, accuracy 99.8%

Note massive amount of redundancy and very conservative spam rejection

Too much theory?

Open source implementation available at:
http://code.google.com/p/get-another-label/
Input:
–
–

Output:
–
–
–


Labels from Mechanical Turk
Cost of incorrect labelings (e.g., XG costlier than GX)
Corrected labels
Worker error rates
Ranking of workers according to their quality
Beta version, more improvements to come!
Suggestions and collaborations welcomed!
Workers reacting to quality scores
Score-based feedback leads to strange interactions:
The angry, has-been-burnt-too-many-times worker:

“F*** YOU! I am doing everything correctly and you know
it! Stop trying to reject me with your stupid ‘scores’!”
The overachiever:

49
“What am I doing wrong?? My score is 92% and I want to
have 100%”
An unexpected connection at the
NAS “Frontiers of Science” conf.
Your spammers
behave like my
mice!
50
An unexpected connection at the
NAS “Frontiers of Science” conf.
Your spammers
behave like my
mice!
Eh?
51
An unexpected connection at the
NAS “Frontiers of Science” conf.
Your spammers want to
engage their brain only
for motor skills and
not for cognitive skills
Yeah, makes
sense…
52
An unexpected connection at the
NAS “Frontiers of Science” conf.
And here is how
I train my mice
to behave…
I should try this the
moment that I get
back to my room
53
Implicit Feedback with Frustration

Punish bad answers with frustration (e.g,
delays between tasks)
–
–
–

54
“Loading image, please wait…”
“Image did not load, press here to reload”
“Network error. Return the HIT and accept again”
Reward good answers by giving next task
immediately, with no problems
→Make this probabilistic to keep feedback implicit
First result




Spammer workers quickly abandon
Good workers keep labeling
Bad: Spammer bots unaffected
How to frustrate a bot?
–
55
Give it a CAPTHCA 
Second result (more impressive)

Remember, Don was training the mice…


15% of the spammers start submitting good work!
Putting cognitive effort is more beneficial …

Key trick: Learn to test workers on-the-fly
–
56
Model performance using Dirichlet compound
multinomial distribution (details offline)
Thanks!
Q & A?
+
+
+
+ + +
+ ++ +
+ + +
+ + +
++
+
+
Why does
Model Uncertainty (MU) work?

MU score distributions for
correctly labeled (blue) and
incorrectly labeled (purple)
cases
58
- -- - -- -- -+- - - - -- - ---- ------
+
+
+
+ + +
+ ++ +
+ + +
+ + +
++
+
+
Why does
Model Uncertainty (MU) work?
Self-healing MU
Self-healing
process
- -- - -- -- -+- - - - -- - ---- -----Examples
Models
“active learning” MU
59
What if different labelers have different
qualities?


(Sometimes) quality of
multiple noisy labelers
is better than quality of
best labeler in set
here, 3 labelers:
p-d, p, p+d
60
Soft Labeling vs. Majority Voting

MV: majority voting
ME: soft labeling
70
Accuracy

65
60
MV
ME
55
10
410
810
1210
Number of examples (bmg, p=0.6)
1610
Related topic

Estimating (and using) the labeler quality
–
for multilabeled data: Dawid & Skene 1979; Raykar et
al. JMLR 2010; Donmez et al. KDD09
–
for single-labeled data with variable-noise labelers: Donmez &
Carbonell 2008; Dekel & Shamir 2009a,b
–
to eliminate/down-weight poor labelers: Dekel &
Shamir, Donmez et al.; Raykar et al. (implicitly)
and correct labeler biases: Ipeirotis et al. HCOMP-10
Example-conditional labeler performance
–
–


62
Yan et al. 2010a,b
Using learned model to find bad labelers/labels:
Brodley & Friedl 1999; Dekel & Shamir, Us (I’ll discuss)