DSS Chapter 1

Download Report

Transcript DSS Chapter 1

Decision Support and
Business Intelligence
Systems
(9th Ed., Prentice Hall)
Chapter 5:
Data Mining for Business
Intelligence
Learning Objectives




5-2
Define data mining as an enabling technology
for business intelligence
Understand the objectives and benefits of
business analytics and data mining
Recognize the wide range of applications of
data mining
Understand the steps involved in data
preprocessing for data mining
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Introduction




Data is produced at a phenomenal rate
Our ability to store has grown
Users expect more sophisticated
information
How?
UNCOVER HIDDEN INFORMATION
DATA MINING
5-3
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Examples: What is (not) Data Mining?
 What is not Data Mining?
 Look up phone number in phone directory
 Query a Web search engine for information about “Amazon”
 What is Data Mining?
 Certain names are more prevalent in certain US locations (e.g. in
Boston area,…)



5-4
Group together similar documents returned by search engine
according to their context (e.g. Amazon.com, …)
A customer with income between 10,000 and 20,000 and age
between 20 and 25 who purchased milk and bread is likely to
purchase diapers within 5 years.
The amount of fish sold to people living in a certain area and have
income between 20,000 and 35,000 is increasing.
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Data Mining
Data Mining: the process of extracting valid, previously unknown,
comprehensible, and actionable information from large databases and
using it to make crucial business decisions.



Involves analysis of data and use of software techniques for finding
hidden and unexpected patterns and relationships in sets of data.
Potential Result: Higher-level meta information that may not
be obvious when looking at raw data
Similar terms



5-5
Exploratory data analysis
Data driven discovery
Deductive learning
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
5
Decisions in Data Mining




Databases to be mined
 Relational, transactional, object-oriented, object-relational,
spatial, time-series, text, multi-media, heterogeneous, legacy,
WWW, etc.
Knowledge to be mined
 Association, classification, clustering, etc.
Techniques utilized
 Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
Applications adapted

5-6
Retail, telecommunication, banking, fraud analysis, DNA mining, stock
market analysis, Web mining, Weblog analysis, etc.
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
DBMS and Data Mining
5-7
DBMS
Data Mining
Task
Extraction of detailed
and summary data
Knowledge discovery
of hidden patterns
and insights
Type of result
Information
Insight and Prediction
Method
Deduction (Ask the
question, verify
with data)
Induction (Build the
model, apply it to
new data, get the
result)
Example question
Who purchased
mutual funds in
the last 3 years?
Who will buy a
mutual fund in the
next 6 months and
why?
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Data Mining Tasks

Prediction Tasks


Use some variables to predict unknown or future values of
other variables
Description Tasks

Find human-interpretable patterns that describe the data.
Common data mining tasks

Classification


Clustering


5-8
[Descriptive]
Identify customers with similar buying habits.(Clustering)
Association Rule Discovery


[Predictive]
Find all credit applicants who are poor credit risks. (classification)
[Descriptive]
Find all items which are frequently purchased with milk
Sequential Pattern Discovery
[Descriptive]
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
A Taxonomy for Data Mining Tasks
Data Mining
Learning Method
Popular Algorithms
Supervised
Classification and Regression Trees,
ANN, SVM, Genetic Algorithms
Classification
Supervised
Decision trees, ANN/MLP, SVM, Rough
sets, Genetic Algorithms
Regression
Supervised
Linear/Nonlinear Regression, Regression
trees, ANN/MLP, SVM
Unsupervised
Apriory, OneR, ZeroR, Eclat
Link analysis
Unsupervised
Expectation Maximization, Apriory
Algorithm, Graph-based Matching
Sequence analysis
Unsupervised
Apriory Algorithm, FP-Growth technique
Unsupervised
K-means, ANN/SOM
Prediction
Association
Clustering
Outlier analysis
5-9
Unsupervised
K-means, Expectation Maximization (EM)
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Classification: Definition

Given a collection of records (training set )



Find a model for class attribute as a function
of the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.

5-10
Each record contains a set of attributes, one of the
attributes is the class.
A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it.
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Classification Example
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
5-11
Training
Set
Learn
Classifier
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Test
Set
Model
Classification: Application Example

Direct Marketing


Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
Approach:



Use the data for a similar product introduced before.
We know which customers decided to buy and which
decided otherwise. This {buy, don’t buy} decision forms
the class attribute.
Collect various demographic, lifestyle, and companyinteraction related information about all such customers.


5-12
Type of business, where they stay, how much they earn,
etc.
Use this information as input attributes to learn a
classifier model.
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Clustering Definition

Given a set of data points, each having a set
of attributes, and a similarity measure among
them, find clusters such that


5-13
Data points in one cluster are more similar to one
another.
Data points in separate clusters are less similar to
one another.
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Clustering: Application Example

Market Segmentation:


Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
Approach:


5-14
Collect different attributes of customers based on their
geographical and lifestyle related information.
Find clusters of similar customers.
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Association Rule :Application Example

Supermarket shelf management.



Goal: To identify items that are bought together
by sufficiently many customers.
Approach: Process the point-of-sale data collected
with barcode scanners to find dependencies
among items.
A classic rule -
5-15
If a customer buys diaper and milk, then he is very likely
to buy beer:
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Data Preparation – A Critical DM Task
Real-world
Data
Data Consolidation
·
·
·
Collect data
Select data
Integrate data
Data Cleaning
·
·
·
Impute missing values
Reduce noise in data
Eliminate inconsistencies
Data Transformation
·
·
·
Normalize data
Discretize/aggregate data
Construct new attributes
Data Reduction
·
·
·
Reduce number of variables
Reduce number of cases
Balance skewed data
Well-formed
Data
5-16
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Examples of Data Mining
applications:

Retail / Marketing



Banking



Identifying successful medical therapies.
Banking and Other Financial



5-17
Detecting patterns of CC fraud
Identifying loyal customers.
Medicine


Identifying buying patterns of customers.
Predicting response to mailing campaigns.
Automate the loan application process
Detecting fraudulent transactions
Maximize customer value (cross-, up-selling)
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Examples of Data Mining
applications:

Customer Relationship Management





Manufacturing and Maintenance



5-18
Maximize return on marketing campaigns
Improve customer retention
Maximize customer value (cross-, up-selling)
Identify and treat most valued customers
Predict/prevent machinery failures
Identify anomalies in production systems to optimize the use
manufacturing capacity
Discover novel patterns to improve product quality
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Data Mining Applications

5-19
Brokerage and Securities Trading
 Predict changes on certain bond prices
 Forecast the direction of stock fluctuations
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Data Mining
Software
SPSS PASW Modeler (formerly Clementine)
RapidMiner
SAS / SAS Enterprise Miner
Microsoft Excel
R
Your own code

Commercial






Weka (now Pentaho)
SPSS - PASW (formerly
Clementine)
SAS - Enterprise Miner
IBM - Intelligent Miner
StatSoft – Statistical Data
Miner
… many more
Free and/or Open
Source


KXEN
Weka
RapidMiner…
MATLAB
Other commercial tools
KNIME
Microsoft SQL Server
Other free tools
Zementis
Oracle DM
Statsoft Statistica
Salford CART, Mars, other
Orange
Angoss
C4.5, C5.0, See5
Bayesia
Insightful Miner/S-Plus (now TIBCO)
Megaputer
Viscovery
Clario Analytics
Alone
Thinkanalytics
Source: KDNuggets.com, May 2009
5-20
Total (w/ others)
Miner3D
0
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
20
40
60
80
100
120