The Process of Data Mining
Download
Report
Transcript The Process of Data Mining
Data Mining Methodology
1
Why have a Methodology
Don’t want to learn things that aren’t true
May not represent any underlying reality
○ Spurious correlation
○ May not be statistically significant or may be
statistically significant but coincidental
○ Because data mining makes less assumptions
about the data and searches through a richer
hypothesis space, this is a big issue
Model overfitting is an issue
2
Why have a Methodology II
Data may not reflect relevant population
Data mining normally assumes training data
matches the test and score data
Quick overview of how data used for DM
○ Training set used to build the model
○ Validation set used to tune model or select amongst
alternative models
○ Test set used to evaluate model & report quality
For prediction tasks, test set must have the “answer”
○ Model eventually applied to score set, which for
predictive tasks, does not have the answer
○ Evaluation must always occur on data not used to
build or tune or select the model
3
Why have a Methodology III
Do not want to learn things that are not
useful
May be already known
May not be actionable
4
Hypothesis Testing
Data Mining is not usually used for
hypothesis testing
B&L does not really say this
Typical assumption is data already collected
and you have little influence on the process
○ Data may be in a data warehouse
○ usually do not modify the scenarios for
collecting the data or the parameters
○ Experimental design not part of data mining
○ Active learning is related to this, where you
carefully select the data to learn from
5
The Methodology (Fayyad)
According to the article by Fayyad et. al,
the main steps are:
Data Selection
Preprocessing
Transformation
Data Mining
Interpretation/Evaluation
6
The Methodology (B & L)
According to Berry & Linoff, the main steps are:
Translate business problem into a DM problem
Select Data
Get to know the data
Create a model set
Fix problems with the data (“preprocess”)
Transform the data
Build models (“Data mining”)
Assess models (“Interpret/Evaluate”)
Deploy Models
Assess Results then start over
7
Steps in the Process: Selection
Many of the steps are not very complex,
so her some selective comments:
Selection:
○ DM usually tries to use all available data
○ May not be necessary, can generate learning
curves where see how performance varies
with increasing amounts of data
○ Data Mining is not afraid of using lots of
variables (unlike statistics). But some data
mining methods (especially statistical ones)
do have problems with many variables.
8
Steps in the Process: Know the Data
Getting to know the data:
always useful and also helps make sure you
understand the problem
Data visualization can help
Data mining is not really like a black box
where the computer does all of the work
○ having or generating good features (variables)
is critical. Data visualization can help
9
Steps in the Process: Create Model Set
Creating a model (training) set
Sometimes you may want to form the training
set other than by random sampling
○ It is often recommended to balance the classes if
they are highly unbalanced
Not really a good idea or needed. Can use cost-sensitive
learning instead, but we will address later
May want to focus on harder problems
- Active learning skews the training data, but the purpose
is to save effort in manually labeling the training data
10
Steps in the Process: Create Model Set
Data sets relevant to Data Mining
○ Training set: used to build initial model
○ Validation set: used to either tune model (e.g.,
pruning) or select amongst multiple models
○ Test set: used to evaluate goodness of model
For predictive tasks, must have class labels
○ Score set: Data that model ultimately build for
For predictive tasks, class labels are not available
Note that training, validation and test data come
from labeled data
Cross validation can maximize size of labeled data
○ 10-fold cross validation uses 90% for training and
10% testing. It will entail 10 runs.
11
Steps in the Process: Fix Data
Many data mining methods don’t need as
much variable “fixing” as statistical methods
Types of fixing
Missing values: many ways to fix
Too many categorical values: reduce
○ Binning, etc.
Numerical values skewed
○ Take log etc
Data preprocessing (Fayyad) may just alter
the representation
12
Steps in the Process: Transform
Aggregate data to a higher level
Time series data often must be converted into
examples for classification algorithms
○ Phone call data aggregated from call level to
describe activity associated with a phone #/user
Construction of new features is part of this
step. Feature construction can be critical.
Area of plot more useful for estimating value of
home than length and width.
13
Steps in the Process: Assess Model
Predictive models are assessed based on the
correctness of their predictions
Accuracy is the simplest measure, but often not very
useful since not all errors are equal
○ we will learn more about this later
○ Lift curves are discussed in B&L (p 81)
Lift ratio = P(class|sample) / P(class|population)
Life only makes sense when we can be selective, like in direct
marketing where we don’t have to judge every response
Descriptive models can be hard to evaluate since
their may not be objective criteria
How do you tell if a clustering is meaningful?
○ More on assessment methods later
14
Steps in the Process: Deploy
Research models are fine, we run them off
line and when we want to
In a business, must deal with real-world issues
○ In the WISDM project, we want to classify activities
in real time. This is also needed for many fraud
detection models. Must be able to execute the
model and do it quickly, possibly on different
hardware.
Some tools allow you to export the model as code
○ Even in off-line evaluation, may need to handle
huge amounts of data
15
Steps in the Process: Assess Results
True assessment is not just of model,
but includes the business context
Takes into account all costs and benefits
This may include costs that are very hard to
quantify
○ How much does a false negative medical test
cost it causes the patient to die of a
preventable disease?
16
Steps in the Process: Iterate
Data Mining is an iterative process
Iteration can occur between most of the steps
Example: You don’t like overall results so you
add another feature. You then assess its impact
to see if you should keep it.
Example: You realize that assessment of your
model does not make sense and is missing
some costs, so you then incorporate these costs
into the model
17