Is data mining still a niche technology?

Download Report

Transcript Is data mining still a niche technology?

Mainlining Data Mining:
Jim Gray
Microsoft
Panel talk at ICDE2000
San Diego, 2 Mar 2000
Is data mining still a niche technology?
•
97,363 items on Northern Light re “data mining”
• 9,075,288 items re “data base” or “database”
• Is 100,000 items a niche? (OR: 14K, XML: 250K)
• Today data mining tools for experts (statisticians).
(Decision Trees, Clusters, K-means, Neural
nets…)
• High tech and High Touch
aka: consulting and license fees
And the vendors like it that way.
• Claim that you MUST understand the technology
to use it.
But.. The Petabytes are Coming!!
• We will be/are drowning in data/email/web..
• Abstraction & categorization are key technologies
• But,
– They have to work.
– They have to be trivial to learn.
• Successful Ubiquitous data mining
(clustering/classifiers…)
– Mail Filters/Classifiers
– Resume readers
– Shopping recommendations, Community finders
– Web search engines
Key technical/research issues for
transition to the mainstream?
PROCESS PROBLEMS:
•
•
•
•
Getting data into tool is hell
Scrubbing data is hell
Then comes the easy part: mining
Then comes the really hard part:
visualization and understanding
• Most of us:
– Can’t understand neural nets (that’s bad).
– Can’t understand statistics (that’s a fact).
Key technical/research issues for
transition to the mainstream?
Opportunities: It’s not just numbers
• Text mining
• Time series
• Domain specific
–
–
–
–
Web logs
Protein patterns
Spatial (e.g. geology, astronomy)
Image
New opportunities for KDM?
• Make data capture/scrub/import trivial
• Provide intuitive manipulation interfaces
• Provide simpler analysis concepts
support/confidence concept
precision/recall
ranking
pivot & rollup & cube
• Provide interactive visual data explorer.
• Case in point:
I have yet to see a nice data cube visualizer.
By Year
By Make
By Make & Year
By Color & Year
Sum
RED
WHITE
BLUE
By Make & Co
By Color
Research challenges that will
impact data mining?
• Simpler analysis concepts
• Visualization tools to navigate data
• Better algorithms
= Better answers