Transcript Research

KDD-07 Invited Innovation Talk
August 12, 2007
Usama Fayyad, Ph.D.
Chief Data Officer & Executive VP
Yahoo! Inc.
Research
0
1
Thanks and Gratitude
•
My family: my wife Kristina and my 4 kids; my parents and my sisters
•
My academic roots: The University of Michigan, Ann Arbor – my Ph.D. committee,
•
My Mentors and Collaborators
including Ramasamy Uthurusamy (then at GM Research Labs), grad student colleagues (Jie
Cheng), Internships at GM Research and at NASA’s JPL
– Caltech Astronomy (G. Djorgovski, Nick Weir), Pietro Perona and M.C. Burl
– JPLNASA Colleagues: Padhraic Smyth, Rich Doyle, Steve Chien, Paul Stolorz, Peter
Cheeseman, David Atkinson, many others…
– Microsoft Colleagues: Decision Theory Group, Surajit Chadhuri, Jim Gray, Paul Bradley,
Bassel Ojjeh, Nick Besbeas, Heikki Mannila, Rick Rashid, many others
– Fellows in KDD: Gregpry Piatetsky-Shapiro, Daryl Pregibon, Christos Faloutsos, Geoff
Webb, Bob Grossman, Jiawei Han, Eric Tsui, Tharam Dillon, Chengqi Zhang, many, many
colleagues
•
My Business Partners
– Bassel Ojjeh, Nick Besbeas, many VC’s, many advisers and strategic clients including
Microsoft SQL Server and sales teams
•
My Yahoo! Colleagues:
– Zod Nazem, Jerry Yang, David Filo, Yahoo! exec team, Prabhakar Raghavan, Pavel
Berkhin, Nick Weir, Hunter Madsen, Nitin Sharma, Raghu Ramakrishnan, Y! Research
folks, many at Yahoo SDS and current and previous Yahoo! employees
Research
A Data Miner’s Story –
Getting to Know the Grand Challenges
Personal Observations of a Data Mining Disciple
Usama Fayyad, Ph.D.
Chief Data Officer & Executive VP
Yahoo! Inc.
Research
2
3
Overview
• The setting
• Why data mining is a must?
• Why data mining is not happening?
• A Data Miner’s Story
– Grand Challenges: Pragmatic
– Grand Challenges: Technical
– Some case studies
• Concluding Remarks
Research
4
The data gap…
• The Machinery Moves on:
– Moore’s law: processing “capacity” doubles every 18 months : CPU,
cache, memory
– It’s more aggressive cousin: Disk storage “capacity” doubles every 9
months
• The Demand is exploding:
– Every business is an eBusiness
– Scientific Instruments and Moore’s law
– Government
• The Internet – the ubiquity of the Web
• The Talent Shortage
Research
5
What is Data Mining?
Finding interesting structure in data
• Structure: refers to statistical patterns, predictive
models, hidden relationships
• Interesting: ?
• Examples of tasks addressed by Data Mining
– Predictive Modeling (classification, regression)
– Segmentation (Data Clustering )
– Affinity (Summarization)
• relations between fields, associations, visualization
Research
6
Beyond Data Analysis
• Scaling analysis to large databases
– How to deal with data without having to move it out?
– Are there abstract primitive accesses to the data, in database
systems, that can provide mining algorithms with the
information to drive the search for patterns?
– How do we minimize--or sometimes even avoid--having to scan
the large database in its entirety?
• Automated search
– Enumerate and create numerous hypotheses
– Fast search
– Useful data reductions
• More emphasis on understandable models
– Finding patterns and models that are “interesting” or “novel” to
users.
• Scaling to high-dimensional data and models.
Research
Data Mining and Databases
Many interesting analysis queries are difficult to state
precisely
• Examples:
– which records represent fraudulent transactions?
– which households are likely to prefer a Ford over a Toyota?
– Who’s a good credit risk in my customer DB?
• Yet database contains the information
– good/bad customer, profitability
– did/did not respond to mailout/survey/...
Research
8
Data Mining Grand Vision
ACME CORP ULTIMATE DATA MINING BROWSER
What’s New?
What’s Interesting?
Predict for me
Research
9
The myths…
• Companies have built up some large and
impressive data warehouses
• Data mining is pervasive nowadays
• Large corporations know how to do it
• There are tools and applications that discover
valuable information in enterprise databases
Research
10
The truths…
• Data is a shambles,
– most data mining efforts end up not benefiting
from existing data infra-structure
• Corporations care a lot about data, and are
obsessed with customer behavior and
understanding it
• They talk a lot about it…
• An extremely small number of businesses are
successfully mining data
• The successful efforts are “one-of”, “lucky
strikes”
Research
11
Current state of Databases
 Ancient Egypt
• Data navigation, exploration, & exploitation technology
is fairly primitive:
– we know how to build massive data stores
– we do not know how to exploit them
– we do the book-keeping really well (OLTP)
– Inadequate basic understanding of navigation /systems
• many large data stores are write-only (= data tomb)
Research
12
A Data Miner’s Story
• Started out in pure research
– Professional student
– Math and algorithms
Research
13
Researcher view
Database
Algorithms and
Theory
Systems
Research
14
Practitioner view
Database
Customer
Systems and integration
Algorithms
Research
15
Business view
Customer
Database
Systems
$$$’s
Research
Algorithms
16
A Data Miner’s Story
• Started out in pure research
• At NASA-JPL did basic research and applied
techniques to Science Data Analysis problems
– Worked with top scientists is several fields: astronomy,
planetary geology, atmospherics, space science, remote
sensing imagery
– Great results, strong group, lots of funding, high demand…
• So why move to Microsoft Research?
Research
17
Example: Cataloging Sky Objects
Research
Data Mining Based Solution
• 94% accuracy in recognizing sky objects
• Speed up catalog generation by one to two orders of
magnitude (unrealistic to perform manually).
• Classify objects that are at least one magnitude fainter than
catalogs to-date.
• Tripled the “data yield”
• Generate sky catalogs with much richer content:
– on order of billions of objects:
>
2x107 galaxies
> 2x108 stars,
105 quasars
• Discovered new quasars 40 times more efficiently
Research
Research
20
A Data Miner’s Story
• Started out in pure research
• At NASA-JPL
• At Microsoft Research
– Basic research in algorithms and scalability
– Began to worry about building products and integrating
with database server
– Two groups established: research and product
• So why move out to a start-up?
Research
21
Working with Large Databases
• One scan (or less) of the database
– terminate early if appropriate
• Work within confines of a given limited RAM
buffer
– Cluster a Gigabyte or Terabyte in, say 10 or 100
Megabytes RAM
• “Anytime” algorithm
– best answer always handy
• Pause/resume enabled, incremental
• Operate on forward-only cursor over a view
(essentially a data stream)
Research
22
Business Results Gap
Business users are unable to apply the power of
existing data mining tools to achieve results
Business
Challenges
Acquisition
Conversion
Average Order
Retention
Loyalty
Technologies
Technical
Tools
Neural
Networks
OLAP
Logistic
Regressions
CART
Segmentation
Decision
Trees
Genetic
Algorithms
Bayesian
Networks
Chaid
Research
23
Business Results Gap
Business users are unable to apply the power of
existing data mining tools to achieve results
Business
Challenges
Specialists
Acquisition
Statisticians
Conversion
Data Mining PhDs
Neural
Networks
OLAP
Average Order
DBAs
Retention
Consultants
Loyalty
Technologies
Technical
Tools
Logistic
Regressions
CART
Segmentation
Decision
Trees
Genetic
Algorithms
Bayesian
Networks
Chaid
Research
24
Evolving Data Mining
• Evolution on the technical front:
– New algorithms
– Embedded applications
– Make the analyst life easier
• Evolution on the usability front
– New metaphors
– Vertical applications embedding
– Used by the business user
• In both cases, success means invisibility…
Research
25
Grand Challenges
• Pragmatic:
– Achieving integration and invisibility
• Research/Technical:
– Solving some serious unaddressed problems
Research
26
Pragmatic Grand Challenge 1
Where is the data?
• There is a glut of stored data
• Very little of that data is ready for mining
• Data warehousing has proven that it will not
solve the problem for us
• Solution:
– integration with operational systems
– Take a serious database approach to solving the
storage management problem
Research
27
digiMine Background
Started as Venture Capital-funded company:
digiMine, Inc. in March 2000.
Built, operated and hosted data warehouses
with built-in data mining apps
•
Headquartered in Bellevue, Washington
•
$45 million in funding – Mayfield, Mohr
Davidow, American Express, Deutsche Bank
•
Grew to over 120 employees
•
50 patents+ in technology and processes
•
Both technology and services
Research
28
Sample Customers
Research
29
A Data Miner’s Story
• Started out in pure research
• At NASA-JPL
• At Microsoft Research
• At digiMine
– Lots of VC funding, great team, great press coverage,
and fast moving
– great customers
• So why move to a DMX Group?
Research
30
Why DMX Group?
• At digiMine, we grew a large “Professional Services”
organization
• We learned a lot from these engagements
• VC-funded companies cannot do much consulting
• A fork in the road appeared…
– digiMine re-focused on a market vertical: behavioral
targeting for media and publishers
– Renamed to Revenue Science, Inc.
• Formed DMX Group… which was eventually acquired by
Yahoo!
Research
31
DMX Group Mission
• Make enterprise data a working asset in the
enterprise:
– Data strategy for the business
– Implementation of Business Intelligence and data
mining capabilities
– Business issues around data
• What is possible?
• How to expose it to business users
• How to train people and change processes
– Integration with operational systems
Research
32
Data Strategy
• How can your data influence your revenues?
• How do you optimize operations based on data?
• How do you increase customer retention based on
data?
• How do you utilize enterprise data assets to spot
new opportunities:
– Cross-sell to existing customers
– Grow new markets
– Avoid problems such as fraud, abuse, churn, etc?
Research
33
A Data Miner’s Story
• Started out in pure research
• At NASA-JPL
• At Microsoft Research
• At digiMine/Revenue Science Inc.
• At DMX Group…
Research
34
Pragmatic Grand Challenge 2
Embedding within Operational Systems
• We all worry about algorithms, they are fascinating
• Most of us know that data mining in practice is mostly data prep
work
• Go where the data is when the data does not come to you
• But how much of the problem is “data mining”?
• facts:
– The effort in embedding an application is huge, and often not
discussed
– Without it, all the algorithms are useless
Research
Case Study – Wireless Telco
Churn Modelling and Prediction
Research
35
36
Modeling Process
2
Sample
Database
3
Build
Churn
Model
4
Score
Database
6
High Risk
Med Risk
Low Risk
5
6
High Val
Med Val
Low Val
Value
1
Customer
Interaction
Base
Assign
Customer
Value
SMS
WAP
CDR
Research
Billing
Risk
High Val
High Risk
High Val
Med Risk
High Val
Low Risk
Med Val
High Risk
Med Val
Med Risk
Med Val
Low Risk
Low Val
High Risk
Low Val
Med Risk
Low Val
Low Risk
37
LTV and Its Application
• A customer’s life-time value (LTV) is the net
value that a customer brings in to a business by
the end of their service. I.e. their profit
contribution.
• LTV allows decisions for individual customers that
optimize the return-on-investment (ROI).
Examples:
– Aggressive retention programs, such as equipment
upgrade and contract renewal for high LTV.
– Differentiated customer care treatment for reactivations
by customer with low LTV
Research
38
What is the Required?
• Detailed data
– Integration of CDR, WIG, SMS, Billing
– Maintained at detailed level
• Integrated data mining
– Algorithms tuned to model thousands of variables and millions of
rows
– Accurate Forecasts
• System Robustness
– Massively scalable back end system
– Flexible architecture to create new variables quickly and easily
• Collaborative Service Model
– Service model which guarantees success
– Combined IQ Model to optimize science and business knowledge
– Low cost to create and maintain models
Research
39
Map Segments to Actions
High
Save Program
Let them
go
Cost Reducing
Programs
Churn
Probability
Change Plan
Bad Migration
Behavior
Cautiously
Defend
Equipment
Upgrade
Feature Add
Grow
Margin
Feature Use
Aggressively
Defend
Contract
Renewal
Elite Program
Nurture /
Maintain
Loyalty Programs
Low
Negative
Research
Low
Forecasted
LTV
High
40
Cost Rules Applied…
Cost Rules are introduced to define scoring
For Example:
–
Network System Usage Cost
–
Mobile to Land Connections Costs
–
Technical Operations/Support Costs
–
Long Distance Costs
–
Inter-Carrier /International subsidy costs
–
Roaming Costs
–
Bad Debt Allocation
–
Many others…
Research
41
Cost Rules for a Bank?
Cost Rules are introduced to define value
For Example:
–
Deposit Value
–
Product mix
–
Average. daily balance
–
Monthly service fees
–
Technical operations/Support costs
–
Branch/teller usage
–
Late payment/Overdraft history
–
–
–
–
Interest rate
Contract term
Credit Score
Employment history/Income
Research
42
Pragmatic Grand Challenge 3
Integrating domain knowledge
• Data mining algorithms are knowledge free
• There is no notion of “common sense reasoning”
• Do we have to solve an AI-hard problem?
• Robust and deep domain knowledge utilization
• solution:
– Very deep and very narrow integration
– Ability to “model” business strategy
– Reasoning capability just evolves (c.f. chess players)
Research
43
Cross-Sell / Up-Sell Example
Customer looking for pants
Help Me
Decide
Complete the
Assortment
Any Related
Products
Recommendations
Collaborative
Filtering
Alternates Up Sells
Context
Sensitive
Approach
Research
Complement Add-on
Impulse Buy
44
Pragmatic Grand Challenge 4
Managing and maintaining models
• When was the last time you thought about the lifetime of a
mining model
• What happens when a model is changed
• Have you tried to merge the results of two different clustering
models over time?
• How many “data droppings” (aka temp files, quick
transformations, quick fixes) do you generate in an analysis
session?
• A framework for managing, updating, and
retiring mining models
• solution: use techniques that have been invented for
this, databases, systems mngmt, s/w engr, etc…
Research
45
Pragmatic Grand Challenge 5
Effectiveness Measurement
• How do we measure [honestly] the effectiveness of a model in a
context?
• Return on Investment (ROI) measurement
• Evaluation in the context of the application
• A framework and methodology for measurement
and evaluation
– Build the measurement method as part of the design of the
model
– An engineering recipe for measurements, and a set of metrics
Research
Technical Challenges
Research
46
47
Technical Challenges
0. Public benchmark data sets
•
•
•
•
As a field we have failed to define a common data collection
Very difficult to judge research and systems advances
Not an easy task, but not impossible
A mix of
– synthetic (but realistic) data sets
– and real datasets
Research
48
Technical Challenges
1. How does the data grow?
• A theory for how large data sets get to be large
• Definitely not IID sampling from a static distribution
• Inappropriateness of a “single-population” model
2. Complexity/understandability tradeoff
• Explaining how, when and why a model works
• Explaining when a model fails
• A “Tuning Dial” for reducing the complex into the
understandable
Research
49
Technical Challenges
3. Interestingness
• What is an “interesting” pattern or summary?
• How do you measure “novelty”?
• What is “unusual”? When is it worthy of attention?
• Is it low probability events? High summarization ability? Outliers?
Good fits? Bad fits?
Research
50
Technical Challenges
4. Scalability
Beyond just dealing with a large data set:
• Principled feature reduction: what is SVD equivalent? Graceful
degradation with dimensionality
• Uncovering graphical structure in data
– Communities, relations, link analysis, …
• Dealing with multiple data types:
– Structured, sparse, dense, text, images, video, audio, sequence
data, etc.
– I have yet to see an algorithm that deals with more than one type.
• Integration with DBMS
– Appropriate sampling
– Appropriate operator abstractions
• Taking care of “minor details”
– Initialization?
– Determining k
Research
51
Technical Challenges
5. A theory for what we do
• What are the fundamental abstractions?
• What are the basics operations? What are the basic
components of an algorithm?
• What is it that we are optimizing?
• What is hard? What is doable? Why?
• What is a “data summary”?
• When are two attributes “similar”? Can you measure
efficiently?
• How do we extract the right representation?
Research
52
A new theory is needed
• What are the fundamental problems?
• What do partial models or summaries of data really
mean?
• What are the implications of post hoc data analysis?
When is it/is it not reasonable to conclude a task is
appropriate?
• A new algebra for dealing with highly-summarized
views of the world
• Effect of sparse spaces on dimensionality. What is the
true dimensionality of data? What are the limits?
• A theory for adaptive sampling
Research
Summary
Pragmatic and Technical Grand Challenges
Research
53
54
Challenges
0. Public and challenging benchmark data sets
Pragmatic
Technical
1. Where’s the Data?
1. Understanding “large”
2. In Situ mining
2. Simplicity knob
3. Domain knowledge
3. Interestingness
4. Life-cycle maintenance 4. Scalability
5. Metrics
5. Theory of what we do
A Scorecard for the field: At least 2 advances in the
next 10 years!!!
Research
55
Data Mining Grand Vision
ACME CORP ULTIMATE DATA MINING BROWSER
What’s New?
What’s Interesting?
Predict for me
Research
56
In the meantime, there is an
understanding gap
• The technical community speaks of tech
problems
• The business strategic thinking hit an
“understandability wall”
• Traditionally, the thinking of business
strategy never included data
• A new generation of business challenges
are born
Research
57
Data Strategy
• Is the mapping of the capabilities enabled by
data in driving the business
• The Integration of data-driven capabilities in
revenue-driving activities
• The Integration of data-derived metrics to
feedback into the measurement of the success
of the business
• Evolving to an operational state where planning
includes data, measurability, and data-driven
feedback loops
Research
58
A Data Miner’s Story
• Started out in pure research
• At NASA-JPL
• At Microsoft Research
• At digiMine/Revenue Science Inc.
• At DMX Group
• So why join Yahoo! ?
Research
Yahoo! Case Study
Evolving the Data Strategy as Chief Data Officer
Research
59
60
Yahoo! is the #1 Destination on the
Web
73% of the U.S. Internet population uses
Yahoo!
– About 500 million users per month globally!
•
Global network of content, commerce, media,
search and access products
•
100+ properties including mail, TV, news,
shopping, finance, autos, travel, games, movies,
health, etc.
•
25 terabytes of data collected each day… and
growing
•
Representing thousands of cataloged consumer
behaviors
More people visited Yahoo! in
the past month than:
•
•
•
•
•
•
Use coupons
Vote
Recycle
Exercise regularly
Have children living at
home
Wear sunscreen regularly
Data is used to develop
content, consumer, category
and campaign insights for our
key content partners and large
advertisers
Research Sources: Mediamark Research, Spring 2004 and comScore Media Metrix, February 2005.
61
Yahoo! Data – A league of its own…
Terrabytes of Warehoused Data
Millions of Events Processed Per Day
14,000
Y! Data
Highway
GRAND CHALLENGE PROBLEMS OF DATA PROCESSING
TRAVEL, CREDIT CARD PROCESSING, STOCK EXCHANGE, RETAIL, INTERNET
Y! PROBLEM EXCEEDS OTHERS BY 2 ORDERS OF MAGNITUDE
Research
Y! Main
warehouse
Y! Panama
100
Walmart
NYSE
94
Y! Panama
Warehouse
VISA
49
Y! LiveStor
SABRE
500
25
AT&T
225
1,000
Korea
Telecom
120
2,000
Amazon
50
5,000
62
To be continued…
• Will cover the Yahoo! case study on Tuesday’s
Invited talk
• Will include
– Strategic Importance of Data
– Evolving the data strategy
– Evolving towards the need to invent the new sciences
of the Internet
Hope the Data Miner’s Story continues…
Perhaps to a happy ending?
Research
Thank You!
Research
&
Questions?
[email protected]
63