WikiScanner - VIRGIL.GRiffith
Download
Report
Transcript WikiScanner - VIRGIL.GRiffith
My summer of dilettante datamining
Virgil Griffith - Disruptive Technologist
California Institute of Technology
[email protected]
http://virgil.gr
Format of Presentation
WikiScanner
Amateur Data-mining is easy, fun, and
high-impact.
You should try it!
When you edit Wikipedia…
You can edit leaving your username.
OR
You can edit anonymously, using your
computer’s IP address instead of a username.
BUT
Sometimes IP addresses can be traced back
(it’s tricky though).
Idea!
1) Download ALL of Wikipedia, getting all of the
anonymous edits (Free from Wikimedia.)
Found 34.5M anonymous edits, ~21% of Wikipedia
2) Tracing is hard, so buy a database of what
organizations own which IP addresses
(available from private corporations. $1,000)
2,668,095 different orgs in database.
3) Merge them together!
What you can do now
Type an organization’s name and see all of
the anonymous edits that came from their
local network.
Found 187,529 different orgs with at least 1
edit.
See what organizations have edited a
particular Wikipedia page.
Vote on the interesting stuff
The Harvest
Different % of anonymous edits by country
Yes, the CIA does in fact edit Wikipedia [1] [2]
FOIA lawsuit filed over Mike Huckabee whitewashing . [1]
Dutch princess white-washes connections to
drug baron.
Politicians do in fact hire staff to police their
pages. [1]
So do corporations. A lot.
Wikimedia wants WikiScanner 2.0
Shinier.
More bells and whistles.
New ways to catch people
Vote stacking
Time overlaps
Automatic vanity checks
Linkspam with Google ads
And more! (but I’m not telling)
Amateur Data-mining is fun
Data-Mining: How to
Connect data from multiple sources and
repurpose it.
This means
1) Download data from different places
2) Use it for something else other than what
was intended.
Tools to get you started
MySQL / Python/ Ruby
General Architecture for Text Engineering
(GATE)
Text Similarity
Dutch data-mining tools
Interesting Sources of Data
Archive.org Wayback Machine
Notable Names Databse (NNDB)
Domain Tools
Social Networks
Facebook, OkCupid, etc.
Any unusual dataset has useful
purposes.
Using data for things you wouldn’t
have guessed
Trust Metrics in Wikipedia text
How many writers made your page?
Google Maps and the impact of climate change
Real-time Wikipedia with Google Maps
Planet Sony
Visualizing Speeches
Repurposing WikiScanner…
Using WikiScanner for social science research
And we’re not even scratching the
surface…
Kink by geography
Straight from OkCupid
Webpage diffs
Archive.org Wayback Machine
Analyze C-SPAN closed-caption data
TV Card + Text Analyzer
Large-scale mining of .doc metadata
Open source tools + Google-mining
Future work on repurposing online
data in interesting ways
Questions?