ppt - Panos Ipeirotis

Download Report

Transcript ppt - Panos Ipeirotis

Spam? No, thanks!
Panos Ipeirotis – New York University
ProPublica, Apr 1st 2010
(Disclaimer: No jokes included)
Panos Ipeirotis - Introduction
 New York University, Stern School of Business
“A Computer Scientist in a Business School”
http://behind-the-enemy-lines.blogspot.com/
Email: [email protected]
Example: Build an Adult Web Site Classifier
 Need a large number of hand-labeled sites
 Get people to look at sites and classify them as:
G (general), PG (parental guidance), R (restricted), X (porn)
Cost/Speed Statistics
 Undergrad intern: 200 websites/hr, cost: $15/hr
 MTurk: 2500 websites/hr, cost: $12/hr
Bad news: Spammers!
Worker ATAMRO447HWJQ
labeled X (porn) sites as G (general audience)
Improve Data Quality through Repeated Labeling
 Get multiple, redundant labels using multiple workers
 Pick the correct label based on majority vote
11 workers
93% correct
1 worker
70% correct
 Probability of correctness increases with number of workers
 Probability of correctness increases with quality of workers
But Majority Voting is Expensive
Single Vote Statistics
 MTurk: 2500 websites/hr, cost: $12/hr
 Undergrad: 200 websites/hr, cost: $15/hr
11-vote Statistics
 MTurk: 227 websites/hr, cost: $12/hr
 Undergrad: 200 websites/hr, cost: $15/hr
Using redundant votes, we can infer worker quality
 Look at our spammer friend ATAMRO447HWJQ
together with other 9 workers
 We can compute error rates for each worker
Our “friend” ATAMRO447HWJQ
P[X → G]=90.153%
mainly marked sites as G.
Obviously a spammer…
P[G → G]=99.947%
Error rates for ATAMRO447HWJQ


P[X → X]=9.847%
P[G → X]=0.053%
Rejecting spammers and Benefits
Random answers error rate = 50%
Average error rate for ATAMRO447HWJQ: 45.2%


P[X → X]=9.847%
P[G → X]=0.053%
P[X → G]=90.153%
P[G → G]=99.947%
Action: REJECT and BLOCK
Results:
 Over time you block all spammers
 Spammers learn to avoid your HITS
 You can decrease redundancy, as quality of workers is higher
After rejecting spammers, quality goes up




Spam keeps quality down
Without spam, workers are of higher quality
Need less redundancy for same quality
Same quality of results for lower cost
Without spam
5 workers
94% correct
Without spam
1 worker
With spam
80% correct
11 workers
93% correct
With spam
1 worker
70% correct
Correcting biases
 Classifying sites as G, PG, R, X
 Sometimes workers are careful but biased
Error Rates for Worker: ATLJIK76YH1TF
P[G → G]=20.0%
P[P → G]=0.0%
P[R → G]=0.0%
P[X → G]=0.0%


P[G → P]=80.0%
P[P → P]=0.0%
P[R → P]=0.0%
P[X → P]=0.0%
P[G → R]=0.0%
P[P → R]=100.0%
P[R → R]=100.0%
P[X → R]=0.0%
P[G → X]=0.0%
P[P → X]=0.0%
P[R → X]=0.0%
P[X → X]=100.0%
Classifies G → P and P → R
Average error rate for ATLJIK76YH1TF: 45.0%
Is ATLJIK76YH1TF a spammer?
Correcting biases
Error Rates for Worker: ATLJIK76YH1TF
P[G → G]=20.0%
P[P → G]=0.0%
P[R → G]=0.0%
P[X → G]=0.0%
P[G → P]=80.0%
P[P → P]=0.0%
P[R → P]=0.0%
P[X → P]=0.0%
P[G → R]=0.0%
P[P → R]=100.0%
P[R → R]=100.0%
P[X → R]=0.0%
P[G → X]=0.0%
P[P → X]=0.0%
P[R → X]=0.0%
P[X → X]=100.0%
 For ATLJIK76YH1TF, we simply need to compute the
“non-recoverable” error-rate (technical details
omitted)
 Non-recoverable error-rate for ATLJIK76YH1TF: 9%
Too much theory?
Open source implementation available at:
http://code.google.com/p/get-another-label/
 Input:
– Labels from Mechanical Turk
– Cost of incorrect labelings (e.g., XG costlier than GX)
 Output:
– Corrected labels
– Worker error rates
– Ranking of workers according to their quality
 Alpha version, more improvements to come!
 Suggestions and collaborations welcomed!
Thank you!
Questions?
“A Computer Scientist in a Business School”
http://behind-the-enemy-lines.blogspot.com/
Email: [email protected]