WikiScanner - VIRGIL.GRiffith

Download Report

Transcript WikiScanner - VIRGIL.GRiffith

My summer of dilettante datamining
Virgil Griffith - Disruptive Technologist
California Institute of Technology
[email protected]
http://virgil.gr
Format of Presentation
WikiScanner
Amateur Data-mining is easy, fun, and
high-impact.
You should try it!
When you edit Wikipedia…
You can edit leaving your username.
OR
You can edit anonymously, using your
computer’s IP address instead of a username.
BUT
Sometimes IP addresses can be traced back
(it’s tricky though).
Idea!
1) Download ALL of Wikipedia, getting all of the
anonymous edits (Free from Wikimedia.)
 Found 34.5M anonymous edits, ~21% of Wikipedia
2) Tracing is hard, so buy a database of what
organizations own which IP addresses
(available from private corporations. $1,000)
 2,668,095 different orgs in database.
3) Merge them together!
What you can do now
Type an organization’s name and see all of
the anonymous edits that came from their
local network.
 Found 187,529 different orgs with at least 1
edit.
See what organizations have edited a
particular Wikipedia page.
Vote on the interesting stuff
The Harvest
Different % of anonymous edits by country
Yes, the CIA does in fact edit Wikipedia [1] [2]
FOIA lawsuit filed over Mike Huckabee whitewashing . [1]
Dutch princess white-washes connections to
drug baron.
Politicians do in fact hire staff to police their
pages. [1]
 So do corporations. A lot.
Wikimedia wants WikiScanner 2.0
Shinier.
More bells and whistles.
New ways to catch people
Vote stacking
Time overlaps
Automatic vanity checks
Linkspam with Google ads
And more! (but I’m not telling)
Amateur Data-mining is fun
Data-Mining: How to
 Connect data from multiple sources and
repurpose it.
 This means
1) Download data from different places
2) Use it for something else other than what
was intended.
Tools to get you started
MySQL / Python/ Ruby
General Architecture for Text Engineering
(GATE)
Text Similarity
Dutch data-mining tools
Interesting Sources of Data
Archive.org Wayback Machine
Notable Names Databse (NNDB)
Domain Tools
Social Networks
Facebook, OkCupid, etc.
Any unusual dataset has useful
purposes.
Using data for things you wouldn’t
have guessed
Trust Metrics in Wikipedia text
How many writers made your page?
Google Maps and the impact of climate change
Real-time Wikipedia with Google Maps
Planet Sony
Visualizing Speeches
Repurposing WikiScanner…
Using WikiScanner for social science research
And we’re not even scratching the
surface…
Kink by geography
Straight from OkCupid
Webpage diffs
Archive.org Wayback Machine
Analyze C-SPAN closed-caption data
TV Card + Text Analyzer
Large-scale mining of .doc metadata
Open source tools + Google-mining
Future work on repurposing online
data in interesting ways
Questions?