Some Interesting Problems
Download
Report
Transcript Some Interesting Problems
Data Mining: Next 10 Years
Rakesh Agrawal
IBM Almaden Research Center
Foundations
What is data mining
A collection of techniques?
A set of composable operations (a la
Relational Algebra)?
Hints:
Inductive Databases (Mannila)
Relational Calculus + Statistical Quantifiers
(Imielinski)
Privacy Implications
Can we build accurate data models while
preserving privacy of individual records?
Hints
Randomization (Agrawal & Srikant): Replace
x by x+y where y is drawn from a known
distribution
Anonymization (Crypto literature)
Web Mining: Beyond Click
Streams
Mining knowledge bases from the web
Completeness
Accuracy
Malicious Spam
Hints:
Brin’s Book experiment
etc. etc.
Web Mining: Beyond hrefs
What other social behaviors exist on the
web and how to make use of them?
Hints:
Viral marketing paper in this conf
etc. etc.
Actionable Patterns
Principled use of domain knowledge for
discarding uninteresting patterns
performance
Hints:
Papers in the recent KDD conferences
Simultaneous mining over
multiple data types
Not just
Relational tables
Time series
Textual documents
But patterns across all of them
Some more problems
Online, incremental algorithms over data
streams
When to retire the past data
Long sequential patterns
Discovering richer patterns (trees and dags)
Automatic, data-dependent selection of
algorithm parameters
What not to work on?
The field is too young!
Too early to say we don’t need new
algorithms
Let every flower bloom!!!
Impressive results of the PVSM algorithm
Emphasize evaluation and benchmarks
Interesting research issues
Applications most likely to
benefit from data mining
Web applications (I think)
Bioinformatics (I hope!)
Inhibitors
Insufficient skill base (Education)
Usability
Grand Challenge
Find
What’s there
What has changed
Across sovereign data repositories
The true delight is in the finding
out, rather than in the knowing.
Isaac Asimov