Some Interesting Problems

Download Report

Transcript Some Interesting Problems

Some Interesting Problems
Rakesh Agrawal
IBM Almaden Research Center
Foundations

What is data mining



A collection of techniques?
A set of composable operations (a la
Relational Algebra)?
Hints:


Inductive Databases (Mannila)
Relational Calculus + Statistical Quantifiers
(Imielinski)
Privacy Implications


Can we build accurate data models while
preserving privacy of individual records?
Hints


Randomization (Agrawal & Srikant): Replace
x by x+y where y is drawn from a known
distribution
Anonymization (Crypto literature)
Web Mining: Beyond Click
Streams

Mining knowledge bases from the web




Completeness
Accuracy
Malicious Spam
Hints:


Brin’s Book experiment
etc. etc.
Web Mining: Beyond hrefs


What other social behaviors exist on the
web and how to make use of them?
Hints:


Viral marketing paper in this conf
etc. etc.
Actionable Patterns

Principled use of domain knowledge for



discarding uninteresting patterns
performance
Hints:

Papers in the recent KDD conferences
Simultaneous mining over
multiple data types

Not just




Relational tables
Time series
Textual documents
But patterns across all of them
Some more problems





Online, incremental algorithms over data
streams
When to retire the past data
Long sequential patterns
Discovering richer patterns (trees and dags)
Automatic, data-dependent selection of
algorithm parameters
What not to work on?

The field is too young!


Too early to say we don’t need new
algorithms


Let every flower bloom!!!
Impressive results of the PVSM algorithm
Emphasize evaluation and benchmarks

Interesting research issues
Applications most likely to
benefit from data mining


Web applications (I think)
Bioinformatics (I hope!)
Inhibitors


Insufficient skill base (Education)
Usability
The true delight is in the finding
out, rather than in the knowing.
Isaac Asimov