NICAR-journalism
Download
Report
Transcript NICAR-journalism
Crowdsourcing using Mechanical Turk
Quality Management and Scalability
Panos Ipeirotis – New York University
Example: Build an “Adult Web Site” Classifier
Need a large number of hand-labeled sites
Get people to look at sites and classify them as:
G (general audience) PG (parental guidance) R (restricted) X (porn)
Cost/Speed Statistics
Undergrad intern: 200 websites/hr, cost: $15/hr
Example: Build an “Adult Web Site” Classifier
Need a large number of hand-labeled sites
Get people to look at sites and classify them as:
G (general audience) PG (parental guidance) R (restricted) X (porn)
Cost/Speed Statistics
Undergrad intern: 200 websites/hr, cost: $15/hr
MTurk: 2500 websites/hr, cost: $12/hr
Bad news: Spammers!
Worker ATAMRO447HWJQ
labeled X (porn) sites as G (general audience)
Improve Data Quality through Repeated Labeling
Get multiple, redundant labels using multiple workers
Pick the correct label based on majority vote
11 workers
93% correct
1 worker
70% correct
Probability of correctness increases with number of workers
Probability of correctness increases with quality of workers
Using redundant votes, we can infer worker quality
Look at our spammer friend ATAMRO447HWJQ
together with other 9 workers
We can compute error rates for each worker
Our “friend” ATAMRO447HWJQ
P[X → G]=99.153%
mainly marked sites as G.
Obviously a spammer…
P[G → G]=99.947%
Error rates for ATAMRO447HWJQ
P[X → X]=0.847%
P[G → X]=0.053%
Rejecting spammers and Benefits
Random answers error rate = 50%
Average error rate for ATAMRO447HWJQ: 49.6%
P[X → X]=0.847%
P[G → X]=0.053%
P[X → G]=99.153%
P[G → G]=99.947%
Action: REJECT and BLOCK
Results:
Over time you block all spammers
Spammers learn to avoid your HITS
You can decrease redundancy, as quality of workers is higher
Too much theory?
Demo and Open source implementation available at:
http://qmturk.appspot.com
Input:
– Labels from Mechanical Turk
– Some “gold” data (optional)
– Cost of incorrect labelings (e.g., XG costlier than GX)
Output:
– Corrected labels
– Worker error rates
– Ranking of workers according to their quality
How to handle free-form answers?
Q: “My task does not have discrete answers….”
A: Break into two HITs:
– “Create” HIT
Creation HIT
– “Vote” HIT
(e.g. transcribe caption)
Voting HIT:
Correct or not?
Vote HIT controls quality of Creation HIT
Redundancy controls quality of Voting HIT
Catch: If “creation” very good, in voting workers just vote “yes”
– Solution: Add some random noise (e.g. misspell)
Example: Collect URLs
But my free-form is
not just right or wrong…
Describe this
“Create” HIT
“Improve” HIT
“Compare” HIT
Creation HIT
(e.g. describe the image)
Improve HIT
Compare HIT (voting)
(e.g. improve description)
Which is better?
TurkIt toolkit: http://groups.csail.mit.edu/uid/turkit/
version 1:
A parial view of a pocket calculator together with
some coins and a pen.
version 2:
A view of personal items a calculator, and some gold and
copper coins, and a round tip pen, these are all pocket
and wallet sized item used for business, writting, calculating
prices or solving math problems and purchasing items.
version 3:
A close-up photograph of the following items: A CASIO multifunction calculator. A ball point pen, uncapped. Various coins,
apparently European, both copper and gold. Seems to be a
theme illustration for a brochure or document cover treating
finance, probably personal finance.
version 4:
…Various British coins; two of £1 value, three of 20p value and
one of 1p value. …
version 8:
“A close-up photograph of the following items: A
CASIO multi-function, solar powered scientific
calculator. A blue ball point pen with a blue rubber
grip and the tip extended. Six British coins; two of £1
value, three of 20p value and one of 1p value. Seems
to be a theme illustration for a brochure or document
cover treating finance - probably personal finance."
Future: Break big task to simple ones and build workflow
Running experiment: Crowdsource big tasks (e.g., tourist guide)
My Boss is a Robot (mybossisarobot.com)
Nikki Kittur (Carnegie Mellon) + Jim Giles (New Scientist)
– Identify sights worth checking out (one tip per worker)
• Vote and rank
– Brief tips for each monument (one tip per worker)
• Vote and rank
– Aggregate tips in meaningful summary
• Iterate to improve…
Thank you!
Questions?
“A Computer Scientist in a Business School”
http://behind-the-enemy-lines.blogspot.com/
Email: [email protected]
Correcting biases
Classifying sites as G, PG, R, X
Sometimes workers are careful but biased
Error Rates for CEO of company detecting offensive content (and parent)
P[G → G]=20.0%
P[P → G]=0.0%
P[R → G]=0.0%
P[X → G]=0.0%
P[G → P]=80.0%
P[P → P]=0.0%
P[R → P]=0.0%
P[X → P]=0.0%
P[G → R]=0.0%
P[P → R]=100.0%
P[R → R]=100.0%
P[X → R]=0.0%
Classifies G → P and P → R
Average error rate : too high
Is she a spammer?
P[G → X]=0.0%
P[P → X]=0.0%
P[R → X]=0.0%
P[X → X]=100.0%
Correcting biases
Error Rates for Worker: ATLJIK76YH1TF
P[G → G]=20.0%
P[P → G]=0.0%
P[R → G]=0.0%
P[X → G]=0.0%
P[G → P]=80.0%
P[P → P]=0.0%
P[R → P]=0.0%
P[X → P]=0.0%
P[G → R]=0.0%
P[P → R]=100.0%
P[R → R]=100.0%
P[X → R]=0.0%
P[G → X]=0.0%
P[P → X]=0.0%
P[R → X]=0.0%
P[X → X]=100.0%
For ATLJIK76YH1TF, we simply need to “reverse the errors”
(technical details omitted) and separate error and bias
True error-rate ~ 9%
Scaling Crowdsourcing: Use Machine Learning
Human labor is expensive, even when paying cents
Need to scale crowdsourcing
Basic idea: Build a machine learning model and use it
instead of humans
Data from existing
crowdsourced answers
New Case
Automatic Model
(through machine learning)
Automatic
Answer
Tradeoffs for Automatic Models: Effect of Noise
Get more data Improve model accuracy
Improve data quality Improve classification
Example Case: Porn or not?
Data Quality = 100%
100
Data Quality = 80%
Accuracy
90
80
Data Quality = 60%
70
60
Data Quality = 50%
50
80
10
0
12
0
14
0
16
0
18
0
20
0
22
0
24
0
26
0
28
0
30
0
60
40
20
1
40
Number of examples (Mushroom)
22
Scaling Crowdsourcing: Iterative training
Use machine when confident, humans otherwise
Retrain with new human input → improve model →
reduce need for humans
Automatic
Answer
New Case
Automatic Model
(through machine learning)
Data from existing
crowdsourced answers
Get human(s) to
answer
Tradeoffs for Automatic Models: Effect of Noise
Get more data Improve model accuracy
Improve data quality Improve classification
Example Case: Porn or not?
Data Quality = 100%
100
Data Quality = 80%
Accuracy
90
80
Data Quality = 60%
70
60
Data Quality = 50%
50
80
10
0
12
0
14
0
16
0
18
0
20
0
22
0
24
0
26
0
28
0
30
0
60
40
20
1
40
Number of examples (Mushroom)
24
Scaling Crowdsourcing: Iterative training, with noise
Use machine when confident, humans otherwise
Ask as many humans as necessary to ensure quality
Automatic
Answer
New Case
Automatic Model
(through machine learning)
Data from existing
Not confident
for quality?
Get human(s) to
crowdsourced answers
answer
Confident for quality?