Data Mining : What we know?

Download Report

Transcript Data Mining : What we know?

Data Mining : Commercial
Applications
趙民德
中央研究院
統計科學研究所
2002/10/28
1
• DM  good data analysis
• KDD  DM with commercial objective in
mind
2
• Data mining for maximum value is difficult
unless a structured plan is followed.
• Knowledge Discovery process to get the
most out of data mining.
3
• Knowledge Discovery and Data Mining:
• The Expectation of Magic
(Dorian Pyle , PC AI magazine, Sept/Oct
1998 )
4
• Business managers seem to expect magic
from applying data mining tools to their
data.
5
This key to appropriate use of data mining
lies in a structured methodology to
•
•
•
•
Find problems,
Define solutions,
Set expectations, and
Deliver results
6
This process is called Knowledge Discovery.
7
10 guiding principles
• Select clearly defined problems that will
yield tangible benefits
• Specify the solution required
• Specify how the solution delivered will be
used
8
• Understand as much as possible about the
problem and the data set (the domain)
• Let the problem drive the modeling (i.e.,
tool selection, data preparation, etc.)
• Stipulate assumptions
• Iteratively refine the model
• Make the model as simple as possible, but
no simpler
9
• Define instability in the model: areas
where change in output is drastically
different for small change in output
• Define uncertainty in the model: critical
areas and ranges in the data set where
the model produces low confidence
prediction or insights.
10
Mining the Data (three parts)
• Preparing the data
• Surveying the data
• Modeling the data.
11
• Briefly, problem exploration involves the
discovery of appropriate problems using
interviewing and problem elicitation techniques.
• Decision support tools, including pair-wise
rankings and ambiguity resolution, help build a
problem matrix.
• The problems are ranked for the benefit each
will return based on various factors of
importance to the problem owner.
12
Solution exploration finds the most effective
solutions for each problem:
• ranking alternatives if necessary.
13
Implementation addresses such issues as:
what is to be delivered,
who will use the solution,
how it will be used,
what training is required to use it,
how long it will remain effective,
how to monitor continued effectiveness.
14
• Data preparation takes at least 60% of the
project’s time.
• Implementation specification is key to the
project’s success
15
• Projects that were very successful
technically can fail because the results
were never implemented in practice.
• Without the will, resources, and
commitment to put the solution in place,
Knowledge Discovery will yield no return
at all!
16
• Let me give you my data, tell me what you
find. Familiar words?
• This is the expectation of magic.
17
The outcome of a data mining project
consists of a model which does one of two
things: The model will be
• Explanatory, or
• Predictive.
18
Inferential models explain the relationships
that exist in data. They may indicate
• the driving factors for stock market
movements, or
• show failure factors in printed circuit board
production.
Regardless of purpose, these models help
explain relationships.
19
• Predictive models may or may not explain
relationships. Primarily, they make
predictions of output conditions given a set
of input conditions
20
• In many direct mail solicitation campaigns,
the marketing manager did not ask what
factor motivated people to respond to the
solicitation.
• Instead, the focus of the model was simply
to increase response.
• If it worked reliably and robustly, fine. If not,
it was of no value.
21
• Whether explanatory or predictive, the
data mining model must provide
actionable information. This is critical to
the project’s success.
22
• The purpose of the project is to provide
information that will allow better decisionmaking.
• Therefore, data mining is a tool in the
decision support arsenal, a formidably
potent tool when properly used.
23
• Knowledge Discovery, as a process,
makes sure the goals of mining data align
with the user’s needs. The results will
directly and unambiguously bear on the
domain of the decision to be made.
24
• Knowledge Discovery aligns the objectives
of the modeler with the problem domain to
search for optimal return for the effort
invested.
25
• Instead of “let me give you my data”,
Knowledge Discovery leads to “let’s
discuss the problem and see what can be
done”. No magic here. This is a structured
search of alternatives and options.
26
• Each stage requires a commitment from
separate groups of people inside a
business or organization. At each stage,
various parties work through the issues,
making choices at each point, and fully
understand the issues and expectations.
27
Example 1
• A Fortune 500 pharmaceutical and biochemical company heard of data mining
and wanted to explore what it could do for
them.
28
• Some of the managers read of the
wonderful things that data mining could do
by just looking at their data.
• Rather than accept copious amounts of
data, the benefits of Knowledge Discovery
is explained.
29
• The initial exploration, which included
Problem and Solution Exploration, took
two weeks.
• When completed, more than 250 problems
were clearly defined for areas including
personnel, manufacturing, inventory
control, and testing.
30
• Managers in each department worked
through defining appropriate problems and
defining solutions.
• Their involvement was crucial to finding
appropriate problems, the solutions to
which would yield real business value.
31
• Senior managers and bio-chemists were
presented with the results, an analysis that
defined where the resources were located
and which projects were to proceed. This
lead to the Implementation phase.
32
• Note the level of involvement of the key
actors. Because they worked through the
problem, understood realistically what
might be done with each problem, and
evaluated the issue of implementing the
solution, this project was a success.
33
Example 2
• A major telecommunications company’s
marketing department wanted data mining
to solve their churn problem.
34
• When presented with the Knowledge
Discovery approach, they dismissed it as
irrelevant in this case. Their problem was
churn: well defined, well understood.
• Build a model to predict churn, and all
would be well.
35
• The data was dirty and polluted, but with
the help of advanced data preparation
techniques, a reliable and robust model
was constructed which was 83% accurate
at predicting churn customers.
36
• The best previous techniques had
achieved about 59% accuracy. The model
provided a 40% improvement in predictive
power.
37
• Marketing then spent a six-figure sum
attempting to avert churn -- to no avail!
38
• Predicting churn was not the problem. The
problem with churn, perhaps, would have
been better addressed by building a
demographic or sociographic model of the
causes of churn, and address those
causes.
• That, however, did not occur.
39
• they were persuaded to try again using the
Knowledge Discovery process. It turned
out that for this company the most
valuable feature was “Customer Lifetime
Value”. To identify and focus on the
motivating factors promoting this feature
yielded significant benefit.
40
• Solving the right problem is more
important than simply building a good
model.
• The Knowledge Discovery process does
exactly that.
41
Three Components of DM
• Data Preparation
• Data Surveying
• The Data Model
42
• Data Preparation is the most important part of
mining.
• Sometimes the data is available in a data
warehouse. This is helpful, but not sufficient.
• Data preparation for data mining is a different
activity than preparing data for warehousing.
• CRISP
43
• Data mining requires fixing the problems
of missing and empty variables, monotonic
variables, categorical ordering, and many
other problems not dealt with in data
warehousing.
44
• In one extreme example, data from a
warehouse not prepared for mining was
modeled and produced a model that was
6% effective at predicting the required
feature. This data had many problems, but
after suitable preparation a reliable and
robust model that was nearly 60%
effective was produced.
45
• Data Surveying involves a look at the
shape of the whole data set, by building a
map of the territory before expending the
time and effort required to create models.
The survey addresses the question
“Is the answer in here anyway?”
46
• The Data Model is the small-scale map of
some very particular part of the territory.
The nature of the data and the purpose of
the model will determine which tools are
appropriate.
• Building the model is the piece that is
typically thought of as data mining --- the
application of automated tools to data.
47
• While important, building the model is just
a piece of the whole Knowledge Discovery
process.
48
• Data mining, the practice of applying automated
pattern detection software tools to data, is not
carried out in isolation from the rest of the world.
• A commercial data mining project will not be
successful if it is not driven by business needs.
To discover and fulfill appropriate business
problems, define solutions to those problems,
use appropriate data, and build useful models
requires an integrated process.
49
• The Knowledge Discovery process
provides the necessary framework to
ensure a successful outcome, if one is
possible.
50
• It is a structured, multi-step process. After
completing each stage, results are evaluated to
determine the most fruitful next step. This
iterative procedure requires the commitment and
involvement of many people. This ensures
everyone involved understand the process, and
that they carefully evaluate the cost and
potential benefits.
51
• Commitment to proceed requires
understanding the value of expected
results.
• At all stages appropriate expectations are
set, and the process is viewed as part of
decision-making and policy guidance.
52
• The committed involvement and
understanding of managers seeking
measurable results removes the
expectation of magic from data mining.
53