cpsr-datamining
Download
Report
Transcript cpsr-datamining
Data Mining
Status and Risks
Dr. Gregory Newby
UNC-Chapel Hill
http://ils.unc.edu/gbnewby
Overview
What is data mining and related concepts?
Fundamentals of the science and practice
of data mining
What data sources are available?
Causality and correlation
Risks of data mining
Future moves
Data Mining
“An information extraction activity whose goal is
to discover hidden facts contained in databases.
…[D]ata mining finds patterns and subtle
relationships in data and infers rules that allow
the prediction of future results. Typical
applications include market segmentation,
customer profiling, fraud detection, evaluation of
retail promotions, and credit risk analysis.”
(Via http://www.twocrows.com/glossary.htm)
Data Mining
Is: Seeking new information from relations
among data, possibly from different
sources
Is: An important area of academic,
corporate and government research
Is: Important from a security standpoint,
because data mining might yield emergent
information that would otherwise remain
unknown
The Bigger Picture
Information retrieval
Data mining
Data fusion
The Data Universe
All data
All topics
All sources
Numeric, textual
Discrete, longitudinal
Lots and lots of data!
The data universe is growing constantly, and
many new data sources are being created as a
result of security concerns & technological
progress
Challenges of the Data Universe
Scale: too much data to deal with
Format: many different formats which are
difficult to merge or query
Access: most data (over 90%?) are not
Web-accessible
Databases
Proprietary or internal data
Formatting problems or issues
Solutions
Figure out how to get data from one format
to another. Standards such as XML and
EDI help
Develop cooperative relationships among
data holders for data exchange. This is
happening much more in government
Develop tools to identify relationships
among data. This is the focus of data
mining
Data Mining != Web Searching
On the Web, we’re doing high precision
information retrieval
We want the first ranked documents to be
relevant
We don’t want to see irrelevant documents
The data universe for Web search engines is
vast, making this a relatively straightforward
problem (though a big engineering challenge!)
Data Mining != Web Searching
Data mining is all about recall, not precision
Recall means we find all the relevant
documents, regardless of how many irrelevant
documents
This is a tougher problem, since the set of
responses to a given inquiry can be huge
It’s tougher : data formats, data merging,
access, etc.
The data miner’s goal is to set a threshold over
which relationships are “interesting”
Data miners can also search for particular
patterns, i.e. related to an individual or group
Today
Law enforcement, industry and government are
making their data sources more open to each
other (these data sources are not generally
publicly available)
Data integrity issues are a major concern
Data mining is still tough. “False positive”
relationships are easy to spot
Correlation vs. causality
Seek and ye shall find
Lots of data yields lots of matches
Today’s Data Sources
Credit and other financials
Law enforcement records
Travel history
Health data
Whatever you put on the Internet
If you are targeted:
Wiretap data (‘net, phone, etc.)
Surveillance data
HUMINT, etc., etc.
Tomorrow
Decreased barriers among different data
sources (this is a main impact of PATRIOT,
but more is coming)
Increased data collection (via PATRIOT
plus technological trends)
Better tools for data mining, and new
technologies making data sharing and
integration easier
Contact Info
Greg Newby is moving from UNC to UAF
New position:
Research Faculty at the
Arctic Region Supercomputing Center
University of Alaska, Fairbanks
[email protected]