Data Mining: Crossing the Chasm

Download Report

Transcript Data Mining: Crossing the Chasm

Data Mining: Crossing the Chasm
Rakesh Agrawal
IBM Almaden Research Center
Thesis
• The greatest challenge facing data mining is
to make the transition from being an early
market technology to mainstream
technology
• We have the opportunity to make this
transition successful
Outline
• Chasm in the technology adoption life
cycle, à la Geoffrey Moore†
• Experience with Quest/Intelligent Miner
• Ideas for successful chasm crossing
† Geoffrey A Moore. Crossing the Chasm. Harper Business.
http://www.chasmgroup.com
Technology Adoption Life Cycle
Pragmatists: Stick
with the herd!
Conservatives:
Hold on!
Visionaries: Get
ahead of the herd!
Skeptics:
No way!
Techies: Try it!
Early
Early
Late
Innovators
Adopters Majority Majority
Laggards
Psychographic profile of each group is different
Innovators: Technology Enthusiasts
• Intrigued by any fundamental advance in
technology
• Like to alpha test new products
• Can ignore the missing elements
• Want access to top technologists
• Want no-profit pricing (preferably free)
Gatekeepers to early adopters
Early Adopters: Visionaries
• Driven by vision of dramatic competitive
advantage via revolutionary breakthroughs
• Great imagination for strategic applications
• Not so price-sensitive
• Want rapid time to market
• Demand high degree of customization
Fund the development of early market
Early Majority: Pragmatists
• Want sustainable productivity improvement
through evolutionary change
• Astute managers of mission-critical apps
• Understand real-world issues and tradeoffs
• Focus on proven applications; want to see
the solution in production
Bulwark of the mainstream market
Late Majority: Conservatives
•
•
•
•
Want to stay even with the competition
Risk averse
Price sensitive
Need completely pre-assembled solutions
Extend technology life cycles
Laggards: Skeptics
• Driven to maintain status quo
• Good at debunking marketing hype
• Disbelieve productivity-improvement
arguments
• Can be formidable opposition to early
adoption of a technology
Retard the development of high-tech markets
Crack in the curve
Chasm
Early Market
Mainstream Market
The greatest peril in the development of a high-tech market lies in
making the transition from an early market dominated by a few
visionaries to a mainstream market dominated by pragmatists.
Visionaries vs. Pragmatists
•
•
•
•
•
•
Adventurous
First strike capability
Early buy-in
State of the art
Think big
Spend big
•
•
•
•
•
•
Prudent
Staying power
Wait-and-see
Industry standard
Manage expectation
Spend to budget
Is data mining following this curve?
• Yes!!!
• My personal viewpoint based on
Quest/Intelligent Miner experience
Quest
• Started as skunk work in early nineties
• Inspired by needs articulated by industry
visionaries:
– Transaction data collected over a long period
– Current tools/SQL don’t cut it
– About ready to throw data
Approach
• Examine “real” applications
• Identify operations that cut across
applications
• Design fast, scalable algorithms for each
operation
• Develop applications by composing
operations
Operations
• Associations
• Sequential Patterns
• Similar time series
• Classification
• Clustering
• Deviations
• New Operations
• Completeness,
scalability
• Adopted from
Statistics/Learning
• Scalability
http://www.almaden.ibm.com/cs/quest
Bringing Quest to market
• Visionaries who inspired Quest did not
become first customers:
– Wanted evidence that the technology “worked”
• Frustrating attempts to interest major IBM
customers:
– Integration with existing applications
– Too-far-out technology
– Resistance from in-house analytic groups
First hits
• Small information-based companies who
provided data in exchange for free results
• CIO who wanted to be seen as the
technology pioneer in his industry
• CIO who wanted the success story to
feature in the company’s annual report
Led to the formation of a group offering services using Quest
Characteristics of engagements
•
•
•
•
Mostly associations and sequential patterns
Completeness a big plus
Unanticipated uses
Feedback for further development
Into the product land
• Formation of a small “out-of-plan” product
group to productize Quest
• Facilitated by a closet mathematician
• Successes of the services group used for
market validation
• Continued development and infusion of
technology
Intelligent Miner
•
•
•
•
Serious product
Integrates technologies from various groups
Fast, scalable, runs on multiple platforms
Several “early market” success
stories
http://www.software.ibm.com/data/iminer/
Are we in the chasm?
• Perceived to be sophisticated technology,
usable only by specialists
• Long, expensive projects
• Stand-alone, loosely-coupled with data
infrastructures
• Difficult to infuse into existing missioncritical applications
Chasm Crossing
• Personal speculations on some technical
challenges
• Do not imply IBM research/product
directions
XML-based Data Mining Standard (1)
Data Specs
Parameters
Standard
DTD
Operator
Library
Model
Standard
DTD
• Model Building:
– A pair of standard
DTDs for each
operation
– Interchangeable
library of operator
implementations
Ack: Mattos, Pirahesh, Schwenkries
XML-based Data Mining Standard (2)
Standard DTDs
• Model Deployment:
Data
Model Mapping
– Mapping XML object
Record
provides mapping
between names and
format in the model
Application Library
object and the data
record
Standard
– Model could have
Result
DTD
been developed on a
different system
Implications
• Standard interfaces for application
developers to incorporate data mining
• Coupling with relational databases
– mappings from DTDs to relational schemas
– implementation using existing infrastructure
Data Mining Benchmarks
• UC Irvine repository
• Generating synthetic benchmarks modeled
after real data sets is a hard problem
– How to map names into meaningful literals
– How to preserve empirical distributions
Ack: Srikant, Ullman
Auto-focus data mining
• Automatic parameter tuning
• Automatic algorithm selection (à la join
method selection in database query
optimization)
Ack: Andreas Arning
Web: Greatest opportunity
• Huge collection of data (e.g. Yahoo
collecting ~50GB every day)
• Universal digital distribution medium
makes data mining results actionable in
fundamentally new ways
• But watch for privacy pitfall
Privacy-preserving data mining
• Technical vs. legislated solutions
• Implication for data mining algorithms
when some fields of a data record have been
fudged according to the user’s privacy
sensitivity
Ack: R. Srikant
Personalization
• Internet might provide for the first time
tools necessary for users to capture
information about themselves and to
selectively release this information†
• Will we be providing these tools?
† John Hagel, Marc Singer. Net Worth. Harvard Business School Press.
What about Association Rules?
• Very long patterns
• Separating wheat from chaff
• Principled introduction of domain
knowledge
What else?
• Formal foundations of data mining
Summary
• Closely couple data
mining with database
systems
• Embed data mining
into applications
• Focus on web
• Standard interfaces
• Benchmarks
• Auto focussing
• Personalization
• Privacy
Concluding remarks
• Data mining, a great technology
– Combination of intriguing theoretical questions
with large commercial interest in the
technology
• Poised for transitioning into mainstream
technology
• Will we rise to the challenge as a
community?
Acknowledgments
Arning
Arnold
Bayardo
Baur
Bollinger Brodbeck
Baune
Carey
Chandra
Cody
Faloutsos Gardner
Gehrke
Ghosh
Greissl
Gruhl
Grove
Gupta
Haas
Ho
Imielinski Iyer
Leyman
Lin
Lingenfelder
Mason
Mehta
Miranda
Psaila
Raghavan Rissanen Sawhney
Sarawagi
Schwenkries Schkolnick
Shafer
Shim
Somani
Srikant
Staub
Traiger
Vu
Zait
Swami
Gunopulos
Lent
McPherson Megiddo