Introduction to Data mining

Download Report

Transcript Introduction to Data mining

Data mining and
the knowledge discovery process
Institute for Knowledge
and Agent Technology
MICC
Universiteit
Maastricht
Summer Course 2007
H.H.L.M. Donkers
Content





Opening / acquaintance
What is data mining
Data mining methodology
Course perspective
Course contents
Data - Information - Knowledge 




Data: symbols
Information: data that are processed to be useful;
provides answers to "who", "what", "where", and
"when" questions
Knowledge: application of data and information;
answers "how" questions
Understanding: appreciation of "why"
Wisdom: evaluated understanding.
(Russell Ackoff - http://www.outsights.com/systems/dikw/dikw.htm)
Data - Information - Knowledge -
http://www.outsights.com/systems/dikw/dikw.htm
What is Data Mining – Traditionally
“Data mining is the extraction of implicit,
previously unknown, and potentially useful
information from data.”
Witten & Frank (2000). Data Mining.
What is Data Mining – Traditionally
“The application of specific algorithms for
extracting patterns from data, it is a part of
knowledge discovery from databases”
Fayyad (1997). From data mining to knowledge
discovery in databases.
What is Data Mining – Traditionally
“Data mining is a process, not just a series of
statistical analyses.”
SAS Institute (2003). Finding the solution to
data mining.
What is Data Mining – Traditionally

Computer Science
• (Semi-)automated
application of algorithms
for pattern discovery
• Algorithms developed in
the field of Artificial
Intelligence (machine
learning)
• Part of the process of
knowledge discovery
Data mining =
Statistics +
Marketing

Statistics
• Process of discovering
patterns in data
• (Manual) application of a
series of statistical
techniques (among which
machine learning)
• Incorporates
–
–
–
–
Exploration
Sampling
Modeling
Validation
What is Data Mining – A Fusion
“An analytic process designed to explore data in
search of consistent patterns and/or
systematic relationships between variables,
and then to validate the findings by applying
the detected patterns to new subsets of data.
The ultimate goal is prediction.”
Statsoft (2003). Data Mining Techniques.
What is Data Mining – A Fusion
“An information extraction activity whose goal is
to discover hidden facts contained in
databases. Using a combination of machine
learning, statistical analysis, modeling
techniques and database technology, data
mining finds patterns and subtle relationships
in data and infers rules that allow the
prediction of future results.”
Rudjer Boskovic Institute (2001). DMS Tutorial.
Data Mining in this Course

We use the book of Witten & Frank
• Computer science (machine learning) approach

Emphasis on algorithms for pattern discovery
and rule extraction
– What are the underlying models
– What are the properties of the algorithms
– When to use (for which tasks)
– How to apply and to tune
– How to interpret and assess the results
Data Mining Process



These algorithms are only part of a process
that computer scientists call Knowledge
Discovery and the statisticians call Data Mining
The process starts with the recognition of a
problem and ends with the control of a
deployed solution
The whole process needs to be supported for a
successful application
Methodologies for Data Mining

As Data Mining is coming of age, several
methodologies have been developed, each
with their own perspective. We will discuss
three of them:
• Fayyad et al. (Computer science)
– E.g., WEKA
• SEMMA (SAS) (Statistics)
– SAS Enterprise Miner, R
• CRISP-DM (SPSS, OHRA, a.o.) (Business)
– SPSS Clementine
Fayyad’s KDD Methodology
Transformed
data
Target
data
Patterns
Processed
data
Data Mining
data
Selection
Transformation
Preprocessing & feature
selection
& cleaning
Knowledge
Interpretation
Evaluation
SEMMA Methodology
Supported by SAS Enterprise Mining environment
SAMPLE
EXPLORE
Input data,
Sampling,
Data partition
Distribution explorer,
Multiplot,
Insight,
Association,
Variable selection
MODIFY
MODEL
Transform variable,
Filter outliers,
Clustering,
SOM / Kohonen
ASSESS
Assessment,
Score,
Report
Regression,
Tree,
Neural Network,
Ensemble
CRISP-DM Methodology



Developed by data-mining companies (SPSS,
NCR, OHRA, ChryslerDaimler), funded by the
European Commission
Tool-independent / industry-independent
Hierarchical process model
1 Generic phases 2 Generic tasks
3 Specific tasks
4 Task instances

Supported by SPSS Clementine environment
CRISP-DM Methodology
TASKS
Business objective
Business
understanding
Data
understanding
Assess situation
Data mining goals
Data
Preparation
Deployment
Modeling
Evaluation
Project plan
CRISP-DM Methodology
TASKS
Collect data
Business
understanding
Data
understanding
Describe data
Explore data
Data
Preparation
Deployment
Modeling
Evaluation
Verify data quality
CRISP-DM Methodology
TASKS
Select data
Business
understanding
Data
understanding
Clean data
Construct data
Data
Preparation
Integrate data
Format data
Deployment
Modeling
Evaluation
CRISP-DM Methodology
Business
understanding
Data
understanding
Data
Preparation
Deployment
Modeling
Evaluation
TASKS
Select modeling
techniques
Design the test
Build model
Assess model
CRISP-DM Methodology
TASKS
Evaluate results
Business
understanding
Data
understanding
Data
Preparation
Deployment
Modeling
Evaluation
Review process
Determine next
steps
CRISP-DM Methodology
TASKS
Plan deployment
Business
understanding
Data
understanding
Data
Preparation
Deployment
Modeling
Evaluation
Plan monitoring
and maintenance
Final report
Review project
A Comparison
Transformed
data
Target
data
Knowledge
Patterns
Processed
data
Interpretation
Evaluation
Data Mining
data
Preprocessing
& cleaning
Business
understanding
Transformation
& feature
selection
Data
understanding
Data
Preparation
Selection
Deployment
Modeling
Evaluation
SAMPLE
EXPLORE
Input data,
Sampling,
Data partition
Distribution explorer,
Multiplot,
Insight,
Association,
Variable selection
MODIFY
MODEL
Transform variable,
Filter outliers,
Clustering,
SOM / Kohonen
ASSESS
Assessment,
Score,
Report
Regression,
Tree,
Neural Network,
Ensemble
A Small Poll (July 2002)
Which DM Methodology do you use?
None
Other
My own
My organisation's
SEMMA
Crisp DM
0
20
40
60
80
Source: http://www.kdnuggets.com/polls/2002/methodology.htm
100
Poll repeated (2004)
Which DM Methodology do you use?
None
Other
My own
My organisation's
SEMMA
Crisp DM
0
20
40
60
Source: http://www.kdnuggets.com/polls/2004/data_mining_methodology.htm
80
Course perspective and goal



The perspective is from computer science
(machine learning): Fayyad’s approach
The emphasis is on techniques for the
automated discovery of patterns in data and
the automated extraction of rules (the model
phase of SEMMA and CRISP)
The goal is to get acquainted with these
techniques, so you can use them in the
methodology of your choice
Course contents

Data preparation (Tuesday)
• Selection, preprocessing, transformation

Techniques, algorithms and models
•
•
•
•
•
•

Decision trees (Monday)
Instance based and Bayesian learning (Wednesday)
Neural networks (Wednesday)
Association rules (Thursday)
Clustering (Thursday)
Support Vector Machines (Friday)
Evaluation of learned models (Tuesday)
Course contents

For each technique you learn
• For which tasks it is suitable
– Classification, rules, prediction, …
– Restrictions on input data (numerical, symbolic, etc.)
•
•
•
•
What algorithms are available
What parameters should be tuned
How to interpret the results
How to evaluate the model