Potentials and Challenges

Download Report

Transcript Potentials and Challenges

Data Mining:
Potentials and Challenges
Rakesh Agrawal & Jeff Ullman
Observations

Transfer of data mining research into deployed
applications and commercial products
– Greater success in vertical applications
– Horizontal tools: Examples:



SAS Enterprise Miner: Sophisticated Statisticians segment
DB2 Intelligent Miner: database applications requiring mining
Emergence of the application of data mining in
non-conventional domains
– Combination of structured and unstructured data

New challenges due to security/privacy concerns
 DARPA initiative to fund data mining research
Identifying Social Links Using
Association Rules
Input: Crawl of about 1 million pages
Website Profiling using
Classification
Input: Example pages for each category during training
Discovering Trends Using
Sequential Patterns & Shape Queries
4
Support (%)
3
heat removal
emergency cooling
2
zirconium based alloy
feed water
1
0
1990
1991
1992
1993
1994
Time Periods
Input: i) patent database ii) shape of interest
Discovering Micro-communities
Japanese elementary schools
Turkish student associations
Oil spills off the coast of Japan
Australian fire brigades
Aviation/aircraft vendors
Guitar manufacturers
complete 3-3 bipartite graph
Frequently co-cited pages are related. Pages with large
bibliographic overlap are related.
New Challenges

Privacy-preserving data mining
 Data mining over compartmentalized
databases
Inducing Classifiers over Privacy
Preserved Numeric Data
Alice’s
age
Alice’s
salary
John’s
age
30 | 25K | …
30
become
s 65
(30+35)
50 | 40K | …
Randomizer
Randomizer
65 | 50K | …
35 | 60K | …
Reconstruct
Age Distribution
Reconstruct
Salary Distribution
Decision Tree
Algorithm
Model
Other recent work

Cryptographic approach to privacypreserving data mining
– Lindell & Pinkas, Crypto 2000

Privacy-Preserving discovery of association
rules
– Vaidya & Clifton, KDD2002
– Evfimievski et. Al, KDD 2002
– Rizvi & Haritsa, VLDB 2002
Computation over
Compartmentalized Databases
"Frequent Traveler" Rating Model
Randomized Data
Shipping
Local computations
followed by
combination of
partial models
On-demand secure
data shipping and
data composition
Email
Phone
Demographic
Criminal
Records
State
Birth
Marriage
Local
Credit
Agencies
Some Hard Problems

Past may be a poor predictor of future
– Abrupt changes
– Wrong training examples

Actionable patterns (principled use of domain knowledge?)
 Over-fitting vs. not missing the rare nuggets
 Richer patterns
 Simultaneous mining over multiple data types
 When to use which algorithm?
 Automatic, data-dependent selection of algorithm
parameters
Discussion

Should data mining be viewed as “rich’’ querying
and “deeply’’ integrated with database systems?
– Most of current work make little use of database
functionality

Should analytics be an integral concern of
database systems?
 Issues in data mining over heterogeneous data
repositories (Relationship to the heterogeneous
systems discussion)
Summary

Data mining has shown promise but needs
much more further research
We stand on the brink of great new answers, but even more, of
great new questions -- Matt Ridley