Potentials and Challenges
Download
Report
Transcript Potentials and Challenges
Data Mining:
Potentials and Challenges
Rakesh Agrawal & Jeff Ullman
Observations
Transfer of data mining research into deployed
applications and commercial products
– Greater success in vertical applications
– Horizontal tools: Examples:
SAS Enterprise Miner: Sophisticated Statisticians segment
DB2 Intelligent Miner: database applications requiring mining
Emergence of the application of data mining in
non-conventional domains
– Combination of structured and unstructured data
New challenges due to security/privacy concerns
DARPA initiative to fund data mining research
Identifying Social Links Using
Association Rules
Input: Crawl of about 1 million pages
Website Profiling using
Classification
Input: Example pages for each category during training
Discovering Trends Using
Sequential Patterns & Shape Queries
4
Support (%)
3
heat removal
emergency cooling
2
zirconium based alloy
feed water
1
0
1990
1991
1992
1993
1994
Time Periods
Input: i) patent database ii) shape of interest
Discovering Micro-communities
Japanese elementary schools
Turkish student associations
Oil spills off the coast of Japan
Australian fire brigades
Aviation/aircraft vendors
Guitar manufacturers
complete 3-3 bipartite graph
Frequently co-cited pages are related. Pages with large
bibliographic overlap are related.
New Challenges
Privacy-preserving data mining
Data mining over compartmentalized
databases
Inducing Classifiers over Privacy
Preserved Numeric Data
Alice’s
age
Alice’s
salary
John’s
age
30 | 25K | …
30
become
s 65
(30+35)
50 | 40K | …
Randomizer
Randomizer
65 | 50K | …
35 | 60K | …
Reconstruct
Age Distribution
Reconstruct
Salary Distribution
Decision Tree
Algorithm
Model
Other recent work
Cryptographic approach to privacypreserving data mining
– Lindell & Pinkas, Crypto 2000
Privacy-Preserving discovery of association
rules
– Vaidya & Clifton, KDD2002
– Evfimievski et. Al, KDD 2002
– Rizvi & Haritsa, VLDB 2002
Computation over
Compartmentalized Databases
"Frequent Traveler" Rating Model
Randomized Data
Shipping
Local computations
followed by
combination of
partial models
On-demand secure
data shipping and
data composition
Email
Phone
Demographic
Criminal
Records
State
Birth
Marriage
Local
Credit
Agencies
Some Hard Problems
Past may be a poor predictor of future
– Abrupt changes
– Wrong training examples
Actionable patterns (principled use of domain knowledge?)
Over-fitting vs. not missing the rare nuggets
Richer patterns
Simultaneous mining over multiple data types
When to use which algorithm?
Automatic, data-dependent selection of algorithm
parameters
Discussion
Should data mining be viewed as “rich’’ querying
and “deeply’’ integrated with database systems?
– Most of current work make little use of database
functionality
Should analytics be an integral concern of
database systems?
Issues in data mining over heterogeneous data
repositories (Relationship to the heterogeneous
systems discussion)
Summary
Data mining has shown promise but needs
much more further research
We stand on the brink of great new answers, but even more, of
great new questions -- Matt Ridley