Outputs Background Business objectives and success criteria
Download
Report
Transcript Outputs Background Business objectives and success criteria
Issues in Data Mining
Infrastructure
Authors:
Nemanja Jovanovic, [email protected]
Valentina Milenkovic, [email protected]
Prof. Dr. Veljko Milutinovic, [email protected]
http://galeb.etf.bg.ac.yu/~vm
Page 1/71
Data Mining in the Nutshell
Uncovering the hidden knowledge
Huge n-p complete search space
Multidimensional interface
Page 2/71
A Problem …
You are a marketing manager
for a cellular phone company
Problem: Churn is too high
Turnover (after contract expires) is 40%
Customers receive free phone (cost 125$)
with contract
You pay a sales commission of 250$ per contract
Giving a new telephone to everyone
whose contract is expiring
is very expensive (as well as wasteful)
Bringing back a customer after quitting
is both difficult and expensive
Page 3/71
… A Solution
Three months before a contract expires,
predict which customers will leave
If you want to keep a customer
that is predicted to churn,
offer them a new phone
The ones that are not predicted to churn
need no attention
If you don’t want to keep the customer, do nothing
How can you predict future behavior?
Tarot Cards?
Magic Ball?
Data Mining?
Page 4/71
Still Skeptical?
Page 5/71
The Definition
The automated extraction
of predictive information
from (large) databases
Automated
Extraction
Predictive
Databases
Page 6/71
History of Data Mining
Page 7/71
Repetition in Solar Activity
1613 – Galileo Galilei
1859 – Heinrich Schwabe
Page 8/71
The Return of the
Halley Comet
Edmund Halley (1656 - 1742)
1531
1607
1682
239 BC
1910
1986
Page 9/71
2061 ???
Data Mining is Not
Data warehousing
Ad-hoc query/reporting
Online Analytical Processing (OLAP)
Data visualization
Page 10/71
Data Mining is
Automated extraction
of predictive information
from various data sources
Powerful technology
with great potential to help users focus
on the most important information
stored in data warehouses
or streamed through communication lines
Page 11/71
Data Mining can
Answer question
that were too time consuming
to resolve in the past
Predict future trends and behaviors,
allowing us to make proactive,
knowledge driven decision
Page 12/71
Focus of this Presentation
Data Mining problem types
Data Mining models and algorithms
Efficient Data Mining
Available software
Page 13/71
Data Mining
Problem Types
Page 14/71
Data Mining Problem Types
6 types
Often a combination solves the problem
Page 15/71
Data Description and
Summarization
Aims at concise description
of data characteristics
Lower end of scale of problem types
Provides the user an overview
of the data structure
Typically a sub goal
Page 16/71
Segmentation
Separates the data into
interesting and meaningful
subgroups or classes
Manual or (semi)automatic
A problem for itself
or just a step
in solving a problem
Page 17/71
Classification
Assumption: existence of objects
with characteristics that
belong to different classes
Building classification models
which assign correct labels in advance
Exists in wide range of various application
Segmentation can provide labels
or restrict data sets
Page 18/71
Concept Description
Understandable description
of concepts or classes
Close connection to both
segmentation and classification
Similarity and differences
to classification
Page 19/71
Prediction (Regression)
Finds the numerical value
of the target attribute
for unseen objects
Similar to classification - difference:
discrete becomes continuous
Page 20/71
Dependency Analysis
Finding the model
that describes significant dependences
between data items or events
Prediction of value of a data item
Special case: associations
Page 21/71
Data Mining Models
Page 22/71
Neural Networks
Characterizes processed data
with single numeric value
Efficient modeling of
large and complex problems
Based on biological structures
Neurons
Network consists of neurons
grouped into layers
Page 23/71
Neuron Functionality
I1
W1
I2
W2
I3
W3
In
f
Output
Wn
Output = f (W1*I1, W2*I1, …, Wn*In)
Page 24/71
Training Neural Networks
Page 25/71
Neural Networks - Conclusion
Once trained, Neural Networks
can efficiently estimate value
of output variable for given input
Neurons and network topology
are essentials
Usually used for prediction
or regression problem types
Difficult to understand
Data pre-processing often required
Page 26/71
Decision Trees
A way of representing a series of rules
that lead to a class or value
Iterative splitting of data
into discrete groups
maximizing distance between them
at each split
Classification trees and regression trees
Univariate splits and multivariate splits
Unlimited growth and stopping rules
CHAID, CHART, Quest, C5.0
Page 27/71
Decision Trees
Balance>10
Age<=32
Married=NO
Page 28/71
Balance<=10
Age>32
Married=YES
Decision Trees
Page 29/71
Rule Induction
Method of deriving a set of rules
to classify cases
Creates independent rules
that are unlikely to form a tree
Rules may not cover
all possible situations
Rules may sometimes
conflict in a prediction
Page 30/71
Rule Induction
If balance>100.000
then confidence=HIGH & weight=1.7
If balance>25.000 and
status=married
then confidence=HIGH & weight=2.3
If balance<40.000
then confidence=LOW & weight=1.9
Page 31/71
K-nearest Neighbor and
Memory-Based Reasoning (MBR)
Usage of knowledge
of previously solved similar problems
in solving the new problem
Assigning the class to the group
where most of the k-”neighbors” belong
First step – finding the suitable measure
for distance between attributes in the data
How far is black from green?
+ Easy handling of non-standard data types
- Huge models
Page 32/71
K-nearest Neighbor and
Memory-Based Reasoning (MBR)
Page 33/71
Data Mining Models
and Algorithms
Many other available models and algorithms
Logistic regression
Discriminant analysis
Generalized Adaptive Models (GAM)
Genetic algorithms
Etc…
Many application specific variations
of known models
Final implementation usually involves
several techniques
Selection of solution that match best results
Page 34/71
Efficient Data Mining
Page 35/71
NO
YES
Is It Working?
Don’t Mess With It!
YES
Did You Mess
With It?
You Shouldn’t Have!
NO
Anyone Else
Knows?
NO
YES
You’re in TROUBLE!
NO
Hide It
Can You Blame
Someone Else?
YES
NO PROBLEM!
Page 36/71
YES
Will it Explode
In Your Hands?
NO
Look The Other Way
DM Process Model
5A – used by SPSS Clementine
(Assess, Access, Analyze, Act and Automate)
SEMMA – used by SAS Enterprise Miner
(Sample, Explore, Modify, Model and Assess)
CRISP–DM – tends to become a standard
Page 37/71
CRISP - DM
CRoss-Industry Standard for DM
Conceived in 1996 by three companies:
Page 38/71
CRISP – DM methodology
Four level breakdown of the CRISP-DM methodology:
Phases
Generic Tasks
Specialized Tasks
Process Instances
Page 39/71
Mapping generic models
to specialized models
Analyze the specific context
Remove any details not applicable to the context
Add any details specific to the context
Specialize generic context according to
concrete characteristic of the context
Possibly rename generic contents
to provide more explicit meanings
Page 40/71
Generalized and Specialized
Cooking
Preparing food on your own
Raw
Find
out what
youvegetables?
want to eat
stake
with
Find the recipe for that meal
Check the Cookbook or call mom
Gather the ingredients
Defrost the meat (if you had it in the fridge)
Prepare the meal
Buy missing ingredients
Enjoy
yourthe
food
or borrow
from the neighbors
Clean up everything (or leave it for later)
Cook the vegetables and fry the meat
Enjoy your food or even more
You were cooking
so convince someone else to do the dishes
Page 41/71
CRISP – DM model
Business understanding
Data understanding
Data preparation
Modeling
Business
understanding
Deployment
Evaluation
Deployment
Page 42/71
Evaluation
Data
understanding
Data
preparation
Modeling
Customizing a Web Page
User-friendly design
Prediction of the users interests
Reduction of server workload
Reduction of Web traffic
Page 43/71
Customizing a Web Page
Page 44/71
Business Understanding
Determine business objectives
Assess situation
Determine data mining goals
Produce project plan
Page 45/71
Business Understanding - Outputs
Background
Business objectives and success criteria
Inventory of resources
Requirements, assumptions, and constrains
Risks and contingencies
Terminology
Costs and benefits
Data mining goals and success criteria
Project plan
Initial assessment of tools and techniques
Page 46/71
Customizing a Web Page –
Business Understanding Example
Business objectives
Make the users surfing
Assess
situation
more comfortable
Make the users
Decrease
of overhead
surfingfor users
Data
mining
goals
more
comfortable
Reduction of workload and
Find
Web the
Decrease
traffic
patterns
of overhead for users
Project
planbehavior
in the user
Reduction of workload and
Web traffic
Page 47/71
Data Understanding
Collect initial data
Describe data
Explore data
Verify data quality
Page 48/71
Data Understanding - Outputs
Data collection report
Data
Background
description
of datareport
List of data sources
Data
Detailed
exploration
descriptionreport
of each data source
For each data source, method of acquisition
List of tables or other database objects
Data
Expected
quality
regularities
report or patterns and
Problems
encountered
in data acquisition
methods
of detection
Description
of
each
field
units, codes, etc.
Approach taken to assessincluding
data quality
Regularities or patterns found
Results of data quality assessment
(expected and unexpected)
Any other surprises
Conclusions for data transformation, data cleaning and
any other pre-processing
Conclusions related to data mining goals or
business objectives
Page 49/71
Customizing a Web Page –
Data Understanding Example
Collecting the data
Update the server to monitor
Data
userdescription
behavior
Record the users activities
Results
of data exploring
into a storage
Analyze recorded data
Decide which data is usable
for mining
Verification of the quality of the data
Page 50/71
Data Preparation
Select data
Clean data
Construct data
Integrate data
Format data
Page 51/71
Data Preparation - Outputs
Dataset description report
Background including broad goals and
plan for pre-processing
Description of pre-processing
Detailed description of resultant datasets
Rational for inclusion/exclusion of attributes
Discoveries made during pre-processing
and implications for further work
Dataset
Page 52/71
Customizing a Web Page –
Data Preparation Example
Decide from what period will the users
monitored actions be considered
Make assumptions about
unnecessary monitored data
and discard them
Classify user actions into categories,
group interesting links, etc…
If more information about user is available
from other sources, use them
Transform data into suitable forms
so several modeling techniques
could be applied
Page 53/71
Modeling
Select modeling technique
Generate test design
Build model
Assess model
Page 54/71
Modeling - Outputs
Assessment of DM results with respect to
business success criteria
Test design
Broaddescription
description of the type of model and
Model
the training data to be used
Type assessment
of model and relation to data mining goals
Model
Explanation of how the model will be tested or assessed
Overview
assessment
including
Parameterofsettings
used process
to produce
model
Description
of any
for testing
deviations from
thedata
testrequired
plan
Detailed description of the model and
Description
of any planned
of models
Detailed
assessment
of the examination
model
any special
features
by domain or data experts
Comments
models by
domaininorthe
data
experts
Conclusionson
regarding
patterns
data
Insights into why a certain modeling technique and
certain parameter setting lead to good/bad results
Page 55/71
Customizing a Web Page –
Modeling Example
The problem is prediction of behavior
Regression could be a good solution
due to distinct nature of the data
Create the software
according to the project plan
Observe the behavior of the software
Tune the model after each evaluation phase
if needed
Page 56/71
Evaluation
results = models + findings
Evaluate results
Review process
Determine next steps
Page 57/71
Evaluation - Outputs
Assessment of DM results with respect to
business success criteria
Reviewof
of process
Business Objectives and
Review
List
of possible
actions
Comparison
between
success criterion and DM results
Business Success Criteria
Conclusion about achievability of success criterion
and suitability of data mining process
Review of “Project Success”
Are there new business objectives?
Page 58/71
Customizing a Web Page –
Evaluation Example
Observe the model behavior at work
Collect response from Beta testers
Check user satisfaction
Check server and network engagement
Classify results
Determine which parameter of the model
should be changed
Present new ideas and modifications
Step back into previous phases as needed
Page 59/71
Deployment
Plan deployment
Plan monitoring and maintenance
Produce final report
Review project
Page 60/71
Deployment - Outputs
Monitoring and maintenance plan
Final
Overview
report
of deployment results and indication
which of results may require updating
Summary of Business Understanding
Description
ofobjectives
how updating
be triggered
(background,
and will
success
criteria)
Description
how
updating
will be performed
Summary ofof
data
mining
process
Summary of data mining results
Summary of results evaluation
Summary of deployment and maintenance plan
Cost/benefit analysis
Conclusions for the business
Conclusions for future data mining
Page 61/71
Customizing a Web Page –
Deployment Example
Make the feature available to all users
Make plan for maintenance and user feedback
Analyze costs and benefits
Summarize the whole documentation
Summarize network and server
additional activity
Collect the new ideas
Award according to results
Leave space for upgrade
Page 62/71
At Last…
Page 63/71
Available Software
Page 64/71
Available Software
Discussion of
data mining vendors and software
is not included into this slide set
Page 65/71
Conclusions
Page 66/71
WWW.NBA.COM
Page 67/71
Se7en
Page 68/71
CD – ROM
Page 69/71
Credits
Anne Stern, SPSS, Inc.
Djuro Gluvajic, ITE, Denmark
Obrad Milivojevic, PC PRO, Yugoslavia
Page 70/71
References
Bruha, I., ‘Data Mining, KDD and Knowledge Integration:
Methodology and A case Study”,
SSGRR 2000
Fayyad, U., Shapiro, P., Smyth, P., Uthurusamy, R.,
“Advances in Knowledge Discovery and Data Mining”,
MIT Press, 1996
Glumour, C., Maddigan, D., Pregibon, D., Smyth, P.,
“Statistical Themes nad Lessons for Data Mining”,
Data Mining And Knowledge Discovery 1, 11-28, 1997
Hecht-Nilsen, R., “Neurocomputing”,
Addison-Wesley, 1990
Pyle, D., “Data Preparation for Data Mining”,
Morgan Kaufman, 1999
galeb.etf.bg.ac.yu/~vm
www.thearling.com
www.crisp-dm.com
www.twocrows.com
www.sas.com/products/miner
www.spss.com/clementine
Page 71/71
The END
Page 72/71