Knowledge discovery process

Download Report

Transcript Knowledge discovery process

Chapter 1
Knowledge discovery process
Juha Vesanto
[email protected]
Starting point!
Data exploration starts with data.
?
The real starting point!
Data exploration starts with data.
?
Data exploration starts with
identifying a need.
?
!
Customer
• Problem owners
• Useful
• Problem holders
• Profitable
Participation
Motivation
The process (CRISP-DM)
The process (Pyle)
Exploring the problem
Exploring the solution
Implementation specification
20% work
80% importance
Preparation
Survey
Data modeling
80% work
20% importance
The problem
• Identify the right problem
• Define solvable problem(s)
• Transfer the problem
understanding to the miner
Example
“I really need a model of the Monday and
Friday failure rates so we can stop them!”
•
•
•
•
•
What is a failure?
How it is detected/measured?
Is it a quality problem or just fluctuation of error rates?
Which problem components need to be looked at?
...
The solution
What does the solution look like?
- a program used by an expert
- a data set to be referred to
- a model to be used for prediction
- a presentation / report
- ...
How (and by whom) is the solution
implemented?
Data mining
• Prepare:
both the data and the miner
• Survey:
understand the data
is the data adequate?
• Model:
refining the details
depends on nature of data and the solution goal
Why preparation?
GIGO: fix the data
Get a data set which is
of maximum use
preserves the information
enhanced for problem & model
PIE
Prepared Information Environment
1. prepare the training/testing data
2. transform prepared values to original
3. apply the same preparation to new data
new
data
PIE-in
data
model
PIE-out
report
Why survey?
Get a broad idea of the data:
• what is covered
• what is not covered, or is covered poorly
Dangerous areas:
• bias in data
• sparse data (in a dynamic area)
Is the data adequate?
Modeling hype
Universal approximator
 can be applied to any data
Data-driven
 no theoretical knowledge required
Modeling definition
Model: “a representation … to show the construction
or serve as a copy of something”
= makes information understandable or usable =
Modeling in data mining
Modeling is iterative:
1. Define problem
2. Select tool
3. Collect data
4. Make model
5. Apply
6. Evaluate
Traditional statistical methods: first model, then data
Model types
• Active or passive
• Explanatory or predictive
• Static or continuously learning
Ten golden rules
1. Select clear problem with
tangible benefit
2. Specify required solution
3. Define how solution is
implemented
4. Understand the domain
6. Stipulate assumptions
5. Let the problem drive the
modeling
7. Refine the model
iteratively
8. Make the model as simple
as possible (but no
simpler)
9. Find areas of instability
10. Find areas of uncertainty
Critique
• Model evaluation is missing
• Iteration of planning stage
• Domain expert as data miner