An Invitation to Data-Mining
Download
Report
Transcript An Invitation to Data-Mining
An Invitation to Data-Mining
Virgil -- [email protected]
GregR -- [email protected]
Interz0ne IV
March 12, 2005
Lecture Outline
•
•
•
•
Introducing Data-Mining
Google Hacking
Intermission
Examples of Using Data-Mining for:
• Money
• Power
• Sex
• Closing
The Advent of Databases and
the Internet
• Fact: The amount of data we have access
to is greater than ever before and is still
growing exponentially.
• If nothing else, the continued archival of
current data will quickly add up.
Continued Growth of the Internet
By all accounts this trend is only going to continue. Google alone
has already made phone books, maps, print catalogs, and even
full books all available digitally. Additionally Web cams, RFIDs,
and sensor networks are also going online.
Growth of Digital Information
• A Practical Example…
• Back in the old days news of interesting websites propagated
through word of mouth.
• Then it moved to USENET groups (blogs are a modern
equivalent).
• But, then it became difficult to find the hottest newsgroups.
• To compensate for this we started using search engines.
• Today, we’re frequently using meta-search engines & metablogging sites like technorati.com, memestreams.net, and
del.icio.us.
• Data-Mining is an increasingly a powerful tool to take advantage of
the availability of huge amounts of digitized information.
What is Data-Mining…
• From Wikipedia...
Data mining is been defined as
• [1] "The nontrivial extraction of implicit, previously
unknown, and potentially useful information from data”
• [2] "The science of extracting useful information from large
data sets or databases".
• Like Artificial Intelligence, “Data-Mining” is an widely used
term with general connotations.
What is Data-Mining (contd.)
• Data-Mining is usually broken up into two
distinct steps.
• 1. “Data-Warehousing” – Collecting large
amounts of data
• 2. “Mining / Extraction” – Analysis (often
statistical) of the collected information:
Some Examples of Data
Mining...
• Amazon.com’s Recommendation System
• MusicPlasma.com
• National Security Agency’s ECHELON
ECHELON is the largest electronic spy network in history,
run by the United States, the United Kingdom, Canada,
Australia, and New Zealand. It captures telephone calls,
faxes, e-mails, and IMs from around the world.
ECHELON is estimated to intercept about 3 billion
communications every day. (text-mining)
Other Users of Data Mining
•
•
•
•
•
•
•
•
Nazi’s in France during WWII
Mormons
The Alexa/Google Toolbar
Wal-Mart (i.e. urban myth of correlation of
purchase of beer and diapers)
RIAA/MPAA in P2P
Microsoft in BitTorrent
Rotten.com’s NNDB
Basically, just about everyone is using data
mining for all sorts of things.
Getting your feet wet in
Data-Mining: Using Google
• Using Google is a great place to start datamining.
• The data collection stage has already been
done for you!
• All you need to do is craft the perfect query
to find the interesting parts.
But what could you possibly find
just using Google?
How About…
•
Credit Card Numbers!
•
More Credit Card Numbers!
•
Plain-Text Passwords!
•
Plain-Text passwords of Shoutcast Stations
•
Apache Directory Listings
•
Open Government FTP Sites How
•
Vulnerable Servers
•
Brute Force’able Logins
•
Voyeuristic WebCam Fun
•
Exposed Config Settings
•
Hashed Passwords
About...
• And that’s just scratching the surface!
Intro: “Google Hacking”
• "Google Hacking” is the use of Google’s
data stores for naughty things.
• Makes extensive use of the advanced
Google syntaxes.
• Is trivially easy to do and is rather trendy.
• An excellent guide to get up to speed on
the techniques of "Google Hacking” is the
O'reily book Google Hacks by Tara
Calishain.
Google Hacking: Tools of the Trade
• On the surface, searching Google is
straight forward.
• But, there are many special parameters
(some of which are undocumented)
• You can use these parameters to exclude
everything but the data you're looking for.
Google Syntax Examples
'' ''/-/+/( )
Site:
Filetype:
Related:
Link:
[all]inanchor
[all]inurl:
[all]intext:
[all]intitle:
•
•
•
•
(interz0ne | outerz0ne) extraz0ne
site:.mil
filetype:.doc
related:yak.net
• inanchor:''miserable failure''
• inurl:robots.txt
Some Undocumented Syntaxes…
'..' operator
Find between ranges of
numbers
'*' operator
Single word wild-card
'~' operator
“Fuzzify”
DATERANGE:
Search only documents
indexed within a particular
timeframe.
Google Hacking: Further Reading
• Due to its ease, Google Hacking already has a
large following.
• Johnny Long runs a user-contributed a "Google
Hacking Database" which contains over 1,000
ready made search queries.
http://johnny.ihackstuff.com/
• Johnny Long also has a concise Google Hacking guide.
http://johnny.ihackstuff.com/security/premium/The_Google_Hackers_Guide_v1.0.p
df
Intermission
Questions on anything related or unrelated so
far?
Going Beyond Google
• “Google Hacking” is just the easy stuff.
• Data Mining techniques are applicable to virtually
everything.
• There is a large amount of interesting information
digitally available which is not indexed by Google
(or anyone else).
• To do more interesting things you'll typically be
using one of these as your data set.
• All sorts of data is already out there, all you need
is the ingenuity to find applications for it.
Further Examples of Data Mining
Using Data-Mining to…
• Derive Mother's Maiden Names
• Uncover Corporate and Government
Secrets
• Embarrass minor-celebrities
Deriving Mother's Maiden Names
• Mother’s Maiden Names (MMN’s) are a
common security authenticator
• Used as an authenticator for credit cards,
email accounts, websites, etc. etc.
• Idea: You could mine public records
information from online databases to
automatically derive MMNs for random
people.
About our Study
• The most relevant records are the birth and
marriage records, both of which are “vital
records” within public domain.
• At the very least, there will be some easy
cases to derive MMNs (i.e. uncommon last
names, hyphenated last names, “Jr.”, “III”,
etc.)
• Although thse techniques can be applied
anywhere, we focused on Texas.
Availability of Related Records
• Related public records are available at the
county, state, and national level.
• US Census aged 72 years before released
• Searchsystems.net has a large listing of
county-level records
• Rootsweb provides full user-submitted
family trees
• We got most of our records from the Texas
Bureau of Vital Statistics’ website
Getting Texas Vital Records
• Collected marriage data from the State
Dept of Vital Statistics (records 1966-2002).
• However, the birth records were sealed in
2000, the death records in 2003.
• We found partial copies of the sealed
records on archive.org and full copies on
rootsweb.com and searchsystems.net.
• Furthermore, the death records were only
unlinked, and you can still download death
info from their own servers 2 ½ years later.
Analyzing the Records
• Once we have a large corpus of both birth
and marriage data, we can apply whatever
heuristics we want in connecting children to
marriages.
• Lucky us! Birth records for <= 1950 include
the MMN in plaintext!
• This left us mostly state marriage records
from 1966-2002 and state birth records
from 1951-1995 to analyze.
Heuristics for Determining MMN’s
1.
Children will have the same last name as their
parents.
2.
We do not have to link a child to a particular marriage
record, only to a particular maiden name. An attacker
doesn’t have to pick the correct parents, just the
correct MMN!
3.
The parents' first and middle names are often
repeated within a child's first or middle name.
4.
Children are often born in the same county in which
their parents were recently married.
5.
Factor in Divorce Records [public domain]
6.
Factor in SSDI / State Death Records [public domain]
Measuring our Success for
Compromise
• Recall we need only match up to the correct MMN,
not the correct parents.
• After applying our heuristics we’ll have a list of
possible maiden names. We use data entropy
(Shannon entropy) to measure the ‘disorder’ of the
set of remaining MMNs.
• We then compare the entropy before and after the
application of the heuristics to measure the success
of our attack.
• Before heuristics applied set of MMN’s ≈ 13 bits.
Entropy Graph assuming only same
last names
Results from just assuming
same last name.
Entropy
# Marriages
% Marriages
Compromised Compromised
Chance of
Guessing
MMN
= 0 bits
128,070
2.07%
1/1
<= 1 bit
207,069
3.35%
1/2
<= 2 bits
318,615
5.16%
1/4
<= 3 bits
463,228
7.50%
1/8
Questions?
(By the way, George Bush’s MMN is “Pierce”)
Data-Mining for .doc’s
• In case you weren't aware, the Microsoft .doc
format contains all sorts of interesting “metadata”
within the document.
• At times, this metadata has been known to be
intensely interesting.
• This metadata includes (among other things) the:
Title, Author, Date Created, Date Last Saved,
Editing Time, User’s Machine ID#, and
usernames of who made the last 10 revisions.
• This fact is known to some groups (such as
lawyers), but by in large people don't know about
it.
Past Incidents
• UK Prime Minister Tony Blair published a
dosier on the Iraq War
• A Cambridge prof revealed that most of the
documented was plagiarized from a grad
student in Monterey.
• Inspired by this, Richard Smith of
computerbytesman.com ran analysis of the
dosier's .doc metadata.
• Smith uncover a good deal more of
incriminating evidence and made the Blair
government squirm. [Link]
That's a great idea! Lets do it
better!
• Do massive crawling for all .doc’s on a
particular domain
• Extract all of their metadata
• Put into a database with web-interfacee
• See if anything interesting turns up!
What we've done (work in
progress)
• No conclusive word metadata analysis system
exists.
• We’ve been weaving together bits and pieces
together into an eventual whole.
• Demonstrations:
• [Demo of “The Revisionist” by Michal Zalewski]
• [Demo of Yak’ified “WordLeaker” by Madelman]
• [Demo of unreleased script
strings_against_references. (Works similarly to
Simon Byer’s work)]
.doc Mining -- Conclusions
•Okay, it's not finished yet.
•But not bad for starting this project last
week.
•The core concept works completely, but
needs a little more refinement.
•Better integration is needed, still a few bugs.
Last Example
Anyone recognize this person?
Cat Schwartz, TechTV eye candy
As one of her fans comments….
Cat Schwartz is one of the cute
girls on TechTV. I know
everybody jerks it to Morgan
Webb, but Cat has that nerdy
emo girl cuteness that I and
many others find hard to resist.
She has a blog on which she
does bloggy things like posting
pics of herself, writing crappy
poems, and keeping her fans
abreast of her schedule.
Cat Schwartz and her blog
Like all blog girls, she likes to post suggestive
images of herself on her blog. No one knows
why blog girls do this, but for now let us
simply accept that they do.
[www.catschwartz.com]
Suggestive Image #1
Suggestive Image #2
A little known fact…
• Programs like photoshop store a full
thumbnail of the photo in the EXIF header
extension.
• Furthermore, if only a slight alteration is
made (I.e. cropping), Photoshop doesn’t
regenerate the thumbnail stored in the EXIF
header.
So....
Bringing them up to size…
And the net goes wild!
One enthusiastic fan comments…
“I SPANKED TWICE One
IN A ROW TO
THESE!!! AND I'M GONNA SPANK
AGAIN!!! OMG! OMG! OMG! I
EVEN LICKED MY MONITOR!!!!!!!”
Doing This Even Better
• Crawl USENET for images
• Do math to determine if the image in the EXIF
thumbnail is different from the actual image
• Display the images
• Live Demo using a “Hot or Not” rating system
• Sadly, the results haven’t been that amazing,
most are just uninteresting croppings.
• But a few interesting bits….
Some Data Sets dying for
interesting applications
• FEC Political Donation Data
http://ftp.fec.gov/FEC/presidential/
• GPS Coordinates of Zipcodes + TerraServer
http://www.census.gov/geo/www/tiger/zip1999.zip
• More Public Records // Sexual Offender Databases
http://www.searchsystems.net/
• Social Security Death Index
•
htttp://ssdi.genealogy.rootsweb.com/
• Library of Congress Print Cataolog
http://www.loc.gov/rr/print/catalog.html
• Flickr.com
Ex:http://www.mappr.com
• P2P Network User Behavior
• Nanpa.com
End
V. Griffith, M. Jakobsson (2005); Messin with Texas:
Deriving Mother’s Maiden Names Using Public Records
is available at: http://romanpoet.org/1/mmn.pdf
EXIF Data Mining References:
• Steven J. Murdoch: www.cl.cam.ac.uk/~sjm217
• Maximillian Dornseif: md.hudora.de