Powerpoint slides
Download
Report
Transcript Powerpoint slides
Chapter 3
Data Mining Methodology and
Best Practices
Data Mining’s Virtuous Cycle
1. Identify the business opportunity*
2. Mining data to transform it into
actionable information
3. Acting on the information
4. Measuring the results
* Textbook interchanges “problem” with “opportunity”
2
It’s time to…
• Turn our attention to translating business
opportunities (problems) into data mining
opportunities (problems) including:
– Transforming data into information via:
• Hypothesis testing
• Profiling
• Predictive modeling
– Taking action
• Model deployment
• Scoring
– Measurement
• Assessing a model’s stability & effectiveness before it is used
3
DM General Guidelines
• The DM virtuous cycle (4 steps) is iterative
• No steps should be skipped
• Common sense prevails with respect to
how rigorous each step is carried out
• Simplest approach: ad-hoc queries to test
hypotheses
• Rigorous approach: The 4 steps of the
virtuous cycle expand to become an 11step methodology
4
Why have a Methodology?
• A DM methodology which includes DM
Best Practices helps to avoid:
– Learning things that are not true
– Learning things that are true, but not useful
• Learning things that are not true is more
dangerous than the other.
Why is that? …
5
Learning Things that are not True
• Patterns may not represent any underlying
rule
• Sample may not reflect its parent
population, hence bias
• Data may be at the wrong level of detail
(granularity; aggregation)
Examples?
6
Learning Things that are True, but not Useful
• Learning things that are already known
Examples?
• Learning things that cannot be used
Examples?
7
Hypothesis Testing
• A hypothesis is a proposed explanation whose validity
can be tested by analyzing data
• Purpose is to validate or invalidate preconceived ideas
• Usually included in all DM projects
• Data collection done via:
– Observation
– Experiment (lab, survey)
• Bias must be avoided and usually requires both
analytical and business knowledge to do so
• Hypothesis testing is useful, but often insufficient which
leads us to…
8
Models
• Model: An explanation or description of how something works that
reflects reality well enough that it can be used to make inferences
about the real world.
• We use models every day…Examples?
• DM uses models of data called Model Set
• Applying model set to new data is called Score Set
• Model Set includes:
– Training Set – used to build a set of DM models
– Validation Set – used to choose best DM model
– Test Set – used to determine how the model performs
• Models – 3 kinds of DM models for 3 kinds of tasks…next slide
9
Profiling and Prediction
• Profiling
– describes what is in the data
– Demographic variables
– Inability to distinguish cause and effect (eg. Beer drinkers and
males)
– Focus is on the past to explain it (timing = past)
• Prediction
– Finding patterns in data from prior period(s) that are capable of
explaining or anticipating outcomes in a later period (timing =
future)
– Predictive models require separation in time between the model
inputs and output.
10
Data Mining Methodology
1.
Translate biz opportunity (problem) into DM opportunity (problem)
2.
Select appropriate data
3.
Get to know the data
4.
Create a model set
5.
Fix problems with the data
6.
Transform data to bring information to the surface
7.
Build models
8.
Assess models
9.
Deploy models
10.
Assess results
11.
Begin again
11
In-Class Exercise
• 10 Teams
• Each team take one of the 1-10
methodology steps (step 11 is skipped)
• Discuss it and prepare a 5 minute (or less)
summary for your colleagues
• Have each team present its summary
Discussion: 15 minutes
Present: 45 minutes
12
End of Chapter 3
13