ppt - Panos Ipeirotis
Download
Report
Transcript ppt - Panos Ipeirotis
Spam? No, thanks!
Panos Ipeirotis – New York University
ProPublica, Apr 1st 2010
(Disclaimer: No jokes included)
Panos Ipeirotis - Introduction
New York University, Stern School of Business
“A Computer Scientist in a Business School”
http://behind-the-enemy-lines.blogspot.com/
Email: [email protected]
Example: Build an Adult Web Site Classifier
Need a large number of hand-labeled sites
Get people to look at sites and classify them as:
G (general), PG (parental guidance), R (restricted), X (porn)
Cost/Speed Statistics
Undergrad intern: 200 websites/hr, cost: $15/hr
MTurk: 2500 websites/hr, cost: $12/hr
Bad news: Spammers!
Worker ATAMRO447HWJQ
labeled X (porn) sites as G (general audience)
Improve Data Quality through Repeated Labeling
Get multiple, redundant labels using multiple workers
Pick the correct label based on majority vote
11 workers
93% correct
1 worker
70% correct
Probability of correctness increases with number of workers
Probability of correctness increases with quality of workers
But Majority Voting is Expensive
Single Vote Statistics
MTurk: 2500 websites/hr, cost: $12/hr
Undergrad: 200 websites/hr, cost: $15/hr
11-vote Statistics
MTurk: 227 websites/hr, cost: $12/hr
Undergrad: 200 websites/hr, cost: $15/hr
Using redundant votes, we can infer worker quality
Look at our spammer friend ATAMRO447HWJQ
together with other 9 workers
We can compute error rates for each worker
Our “friend” ATAMRO447HWJQ
P[X → G]=90.153%
mainly marked sites as G.
Obviously a spammer…
P[G → G]=99.947%
Error rates for ATAMRO447HWJQ
P[X → X]=9.847%
P[G → X]=0.053%
Rejecting spammers and Benefits
Random answers error rate = 50%
Average error rate for ATAMRO447HWJQ: 45.2%
P[X → X]=9.847%
P[G → X]=0.053%
P[X → G]=90.153%
P[G → G]=99.947%
Action: REJECT and BLOCK
Results:
Over time you block all spammers
Spammers learn to avoid your HITS
You can decrease redundancy, as quality of workers is higher
After rejecting spammers, quality goes up
Spam keeps quality down
Without spam, workers are of higher quality
Need less redundancy for same quality
Same quality of results for lower cost
Without spam
5 workers
94% correct
Without spam
1 worker
With spam
80% correct
11 workers
93% correct
With spam
1 worker
70% correct
Correcting biases
Classifying sites as G, PG, R, X
Sometimes workers are careful but biased
Error Rates for Worker: ATLJIK76YH1TF
P[G → G]=20.0%
P[P → G]=0.0%
P[R → G]=0.0%
P[X → G]=0.0%
P[G → P]=80.0%
P[P → P]=0.0%
P[R → P]=0.0%
P[X → P]=0.0%
P[G → R]=0.0%
P[P → R]=100.0%
P[R → R]=100.0%
P[X → R]=0.0%
P[G → X]=0.0%
P[P → X]=0.0%
P[R → X]=0.0%
P[X → X]=100.0%
Classifies G → P and P → R
Average error rate for ATLJIK76YH1TF: 45.0%
Is ATLJIK76YH1TF a spammer?
Correcting biases
Error Rates for Worker: ATLJIK76YH1TF
P[G → G]=20.0%
P[P → G]=0.0%
P[R → G]=0.0%
P[X → G]=0.0%
P[G → P]=80.0%
P[P → P]=0.0%
P[R → P]=0.0%
P[X → P]=0.0%
P[G → R]=0.0%
P[P → R]=100.0%
P[R → R]=100.0%
P[X → R]=0.0%
P[G → X]=0.0%
P[P → X]=0.0%
P[R → X]=0.0%
P[X → X]=100.0%
For ATLJIK76YH1TF, we simply need to compute the
“non-recoverable” error-rate (technical details
omitted)
Non-recoverable error-rate for ATLJIK76YH1TF: 9%
Too much theory?
Open source implementation available at:
http://code.google.com/p/get-another-label/
Input:
– Labels from Mechanical Turk
– Cost of incorrect labelings (e.g., XG costlier than GX)
Output:
– Corrected labels
– Worker error rates
– Ranking of workers according to their quality
Alpha version, more improvements to come!
Suggestions and collaborations welcomed!
Thank you!
Questions?
“A Computer Scientist in a Business School”
http://behind-the-enemy-lines.blogspot.com/
Email: [email protected]