سيستمهاي اطلاعات مديريت
Download
Report
Transcript سيستمهاي اطلاعات مديريت
Data Mining
دكترمحسن كاهاني
http://www.um.ac.ir/~kahani/
Motivation:
“Necessity is the Mother of Invention”
Data explosion problem:
Automated data collection tools and mature database
technology lead to tremendous amounts of data stored
in databases, data warehouses and other information
repositories
We are drowning in data, but starving for knowledge!
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Related Fields
Machine
Learning
Visualization
Data Mining and
Knowledge Discovery
Statistics
Databases
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Knowledge Discovery Process
Integration
Interpretation
& Evaluation
Knowledge
Knowledge
__ __ __
__ __ __
__ __ __
DATA
Ware
house
Transformed
Data
Target
Data
Patterns
and
Rules
Understanding
Raw
Data
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
End User
Making
Decisions
Business
Analyst
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Definition of Data Mining
“…The non-trivial process of identifying
valid, novel, potentially useful, and
ultimately understandable patterns in
data…”
Fayyad, Piatetsky-Shapiro, Smyth [1996]
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Need for Data Mining
Data accumulate and double every 9 months
There is a big gap from stored data to knowledge; and the
transition won’t occur automatically.
Manual data analysis is not new but a bottleneck
Fast developing Computer Science and Engineering generates
new demands
Seeking knowledge from massive data
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
When is DM useful
Data rich world
Large data (dimensionality and size)
Image data (size)
Gene chip data (dimensionality)
Little knowledge about data (exploratory data
analysis)
What if we have some knowledge?
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Challenges
Increasing data dimensionality and data size
Various data forms
New data types
Streaming data, multimedia data
Efficient search and access to data/knowledge
Intelligent update and integration
Privacy Concerns
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Results of Data Mining Include:
Forecasting what may happen in the future
Classifying people or things into groups by
recognizing patterns
Clustering people or things into groups based on
their attributes
Associating what events are likely to occur together
Sequencing what events are likely to lead to later
events
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Data Mining versus OLAP
OLAP - On-line
Analytical Processing
Provides you with a very
good view of what is
happening, but can not
predict what will happen
in the future or why it is
happening
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Data Mining Versus Statistical Analysis
Data Mining
Originally developed to act as
expert systems to solve
problems
Less interested in the
mechanics of the technique
If it makes sense then let’s use
it
Does not require assumptions
to be made about data
Can find patterns in very large
amounts of data
Requires understanding of data
and business problem
Data Analysis
Tests for statistical correctness of
models
Are statistical assumptions of
models correct?
Eg Is the R-Square good?
Hypothesis testing
Is the relationship significant?
Use a t-test to validate
significance
Tends to rely on sampling
Techniques are not optimised for
large amounts of data
Requires strong statistical skills
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Data Mining Taxonomy
Predictive Method
- …predict the value of a particular attribute…
Descriptive Method
- …foundation of human-interpretable patterns that
describe the data…
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Data Mining Tasks...
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Deviation Detection [Predictive]
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Data Mining Tasks:
Classification
Learn a method for predicting the instance class from
pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Classification: Linear
Regression
Linear Regression
w0 + w1 x + w2 y >= 0
Regression computes
wi from data to
minimize squared error
to ‘fit’ the data
Not flexible enough
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Classification: Decision
Trees
if X > 5 then blue
else if Y > 3 then blue
else if X > 2 then green
else blue
Y
3
2
5
X
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Example Decision Tree
Splitting Attributes
Tid Refund Marital
Status
Taxable
Income Cheat
1
125K
No
Yes
Single
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
The splitting attribute at a node is
determined based on the Gini index.
10
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Classification: Neural Networks
- efficiently model large and complex problems;
- may be used in classification problems or for
regressions;
- Starts with input layer => hidden layer => output
layer
3
1
4
6
2
Inputs
5
Hidden Layer
Output
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Neural Networks (cont.)
- can be easily implemented to run on massively
parallel computers;
- can not be easily interpret;
- require an extensive amount of training time;
- require a lot of data preparation (involve very careful
data cleansing, selection, preparation, and preprocessing);
- require sufficiently large data set and high signal-to
noise ratio.
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Classification Example
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
10
No
Single
90K
Yes
Training
Set
Learn
Classifier
Test
Set
Model
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Classification Application
Direct Marketing
Fraud Detection
Customer Attrition/Churn
Sky Survey Cataloging
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Data Mining Tasks:
Clustering
Goal is to identify categories
Natural grouping of customers
by processing all the available
data about them.
Other applications
market segmentation, discovering
affinity groups, and defect
analysis
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Kohonen Network
Description
unsupervised
seeks to
describe dataset
in terms of
natural clusters
of cases
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Data Mining Tasks:
Association Rule Discovery
Given a set of records each of which contain some
number of items from a given collection;
Produce dependency rules which will predict occurrence
of an item based on occurrences of other items.
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Association Rule
Discovery Application
Marketing and Sales Promotion
Supermarket Shelf Management
Inventory Management
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Deviation Detection & Pattern Discovery
Deviation Detection:
…discovering most significant changes in data from
previously measured or normative values…
V. Kumar, M. Joshi, Tutorial on High Performance Data Mining.
Sequential Pattern Discovery:
…process of looking for patterns and rules that predict
strong sequential dependencies among different
events…
V. Kumar, M. Joshi, Tutorial on High Performance Data Mining.
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Sequential Patterns
Identify frequently occurring sequences from given
records
40 percent of female customers buy a gray skirt six
months after buying a red jacket
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Data Mining Methodology: SAS
Sample
Extract a portion of the dataset for data mining
Explore
Modify
create, select and transform variables with the intention of building
a model
Model
Specify a relationship of variables that reliably predicts a desired
goal
Assess
Evaluate the practical value of the findings and the model resulting
from the data mining effort
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Data Mining Methodology:
CRISP-DM
Data understanding
Data preparation
Modeling
Evaluation
Deployment
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
CRISP-DM Phases
سيستمهاي خبره و مهندسي دانش -دكتر كاهاني
Phases and Tasks
Business
Understanding
Determine
Business Objectives
Background
Business Objectives
Business Success
Criteria
Situation Assessment
Inventory of Resources
Requirements,
Assumptions, and
Constraints
Risks and Contingencies
Terminology
Costs and Benefits
Determine
Data Mining Goal
Data Mining Goals
Data Mining Success
Criteria
Data
Understanding
Collect Initial Data
Initial Data Collection
Report
Data
Preparation
Data Set
Data Set Description
Select Data
Data Description Report
Rationale for Inclusion /
Exclusion
Explore Data
Clean Data
Describe Data
Data Exploration Report
Verify Data Quality
Data Quality Report
Data Cleaning Report
Construct Data
Derived Attributes
Generated Records
Integrate Data
Merged Data
Format Data
Modeling
Select Modeling
Technique
Modeling Technique
Modeling Assumptions
Generate Test Design
Test Design
Build Model
Parameter Settings
Models
Model Description
Assess Model
Model Assessment
Revised Parameter
Settings
Evaluation
Evaluate Results
Assessment of Data
Mining Results w.r.t.
Business Success
Criteria
Approved Models
Review Process
Review of Process
Determine Next Steps
List of Possible Actions
Decision
Deployment
Plan Deployment
Deployment Plan
Plan Monitoring and
Maintenance
Monitoring and
Maintenance Plan
Produce Final Report
Final Report
Final Presentation
Review Project
Experience
Documentation
Reformatted Data
Produce Project Plan
Project Plan
Initial Asessment of
Tools and Techniques
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Major Application Areas for
Data Mining Solutions
Fraud/Non-Compliance
Anomaly detection
Isolate the factors that lead to
fraud, waste and abuse
Target auditing and
investigative efforts more
effectively
Credit/Risk Scoring
Intrusion detection
Parts failure prediction
Recruiting/Attracting
customers
Maximizing profitability
(cross selling, identifying
profitable customers)
Service Delivery and
Customer Retention
Build profiles of customers likely
to use which services
Web Mining
Health Care
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Controversial Issues
Data mining (or simple analysis) on people may come with a
profile that would raise controversial issues of
Discrimination
Privacy
Security
Examples:
Should males between 18 and 35 from countries that produced
terrorists be singled out for search before flight?
Can people be denied mortgage based on age, sex, race?
Women live longer. Should they pay less for life insurance?
34
Data Mining and
Discrimination
Can discrimination be based on features like sex,
age, national origin?
In some areas (e.g. mortgages, employment), some
features cannot be used for decision making
In other areas, these features are needed to assess the
risk factors
E.g. people of African descent are more susceptible to
sickle cell anemia
35
Data Mining and Privacy
Can information collected for one purpose be used for mining
data for another purpose
In Europe, generally no, without explicit consent
In US, generally yes
Companies routinely collect information about customers and
use it for marketing, etc.
People may be willing to give up some of their privacy in
exchange for some benefits
See Data Mining And Privacy Symposium,
www.kdnuggets.com/gpspubs/ieee-expert-9504-priv.html
36
Data Mining and Privacy
Data Mining looks for patterns, not people!
Technical solutions can limit privacy invasion
Replacing sensitive personal data with anon. ID
Give randomized outputs
Multi-party computation – distributed data
…
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
The Hype Curve for
Data Mining and Knowledge Discovery
Over-inflated
expectations
Growing acceptance
and mainstreaming
rising
expectations
Disappointment
Performance
Expectations
1990
1998
2000
2002
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Final Remarks
Data Mining can be utilized for any field that
needs to find patterns or relationships in their
data.
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش