cpsr-datamining

Download Report

Transcript cpsr-datamining

Data Mining
Status and Risks
Dr. Gregory Newby
UNC-Chapel Hill
http://ils.unc.edu/gbnewby
Overview
What is data mining and related concepts?
 Fundamentals of the science and practice
of data mining
 What data sources are available?
 Causality and correlation
 Risks of data mining
 Future moves

Data Mining

“An information extraction activity whose goal is
to discover hidden facts contained in databases.
…[D]ata mining finds patterns and subtle
relationships in data and infers rules that allow
the prediction of future results. Typical
applications include market segmentation,
customer profiling, fraud detection, evaluation of
retail promotions, and credit risk analysis.”
(Via http://www.twocrows.com/glossary.htm)
Data Mining
Is: Seeking new information from relations
among data, possibly from different
sources
 Is: An important area of academic,
corporate and government research
 Is: Important from a security standpoint,
because data mining might yield emergent
information that would otherwise remain
unknown

The Bigger Picture
Information retrieval
Data mining
Data fusion
The Data Universe







All data
All topics
All sources
Numeric, textual
Discrete, longitudinal
Lots and lots of data!
The data universe is growing constantly, and
many new data sources are being created as a
result of security concerns & technological
progress
Challenges of the Data Universe
Scale: too much data to deal with
 Format: many different formats which are
difficult to merge or query
 Access: most data (over 90%?) are not
Web-accessible

Databases
 Proprietary or internal data
 Formatting problems or issues

Solutions
Figure out how to get data from one format
to another. Standards such as XML and
EDI help
 Develop cooperative relationships among
data holders for data exchange. This is
happening much more in government
 Develop tools to identify relationships
among data. This is the focus of data
mining

Data Mining != Web Searching

On the Web, we’re doing high precision
information retrieval
We want the first ranked documents to be
relevant
 We don’t want to see irrelevant documents
 The data universe for Web search engines is
vast, making this a relatively straightforward
problem (though a big engineering challenge!)

Data Mining != Web Searching






Data mining is all about recall, not precision
Recall means we find all the relevant
documents, regardless of how many irrelevant
documents
This is a tougher problem, since the set of
responses to a given inquiry can be huge
It’s tougher : data formats, data merging,
access, etc.
The data miner’s goal is to set a threshold over
which relationships are “interesting”
Data miners can also search for particular
patterns, i.e. related to an individual or group
Today



Law enforcement, industry and government are
making their data sources more open to each
other (these data sources are not generally
publicly available)
Data integrity issues are a major concern
Data mining is still tough. “False positive”
relationships are easy to spot



Correlation vs. causality
Seek and ye shall find
Lots of data yields lots of matches
Today’s Data Sources








Credit and other financials
Law enforcement records
Travel history
Health data
Whatever you put on the Internet
If you are targeted:
Wiretap data (‘net, phone, etc.)
Surveillance data
HUMINT, etc., etc.
Tomorrow
Decreased barriers among different data
sources (this is a main impact of PATRIOT,
but more is coming)
 Increased data collection (via PATRIOT
plus technological trends)
 Better tools for data mining, and new
technologies making data sharing and
integration easier

Contact Info
Greg Newby is moving from UNC to UAF
 New position:

Research Faculty at the
Arctic Region Supercomputing Center
University of Alaska, Fairbanks
 [email protected]