سيستمهاي اطلاعات مديريت

Download Report

Transcript سيستمهاي اطلاعات مديريت

Data Mining
‫دكترمحسن كاهاني‬
http://www.um.ac.ir/~kahani/
Motivation:
“Necessity is the Mother of Invention”
 Data explosion problem:
 Automated data collection tools and mature database
technology lead to tremendous amounts of data stored
in databases, data warehouses and other information
repositories
 We are drowning in data, but starving for knowledge!
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Related Fields
Machine
Learning
Visualization
Data Mining and
Knowledge Discovery
Statistics
Databases
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Knowledge Discovery Process
Integration
Interpretation
& Evaluation
Knowledge
Knowledge
__ __ __
__ __ __
__ __ __
DATA
Ware
house
Transformed
Data
Target
Data
Patterns
and
Rules
Understanding
Raw
Data
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
End User
Making
Decisions
Business
Analyst
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Definition of Data Mining
“…The non-trivial process of identifying
valid, novel, potentially useful, and
ultimately understandable patterns in
data…”
Fayyad, Piatetsky-Shapiro, Smyth [1996]
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Need for Data Mining
 Data accumulate and double every 9 months
 There is a big gap from stored data to knowledge; and the
transition won’t occur automatically.
 Manual data analysis is not new but a bottleneck
 Fast developing Computer Science and Engineering generates
new demands
 Seeking knowledge from massive data
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
When is DM useful
 Data rich world
 Large data (dimensionality and size)
 Image data (size)
 Gene chip data (dimensionality)
 Little knowledge about data (exploratory data
analysis)
 What if we have some knowledge?
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Challenges
 Increasing data dimensionality and data size
 Various data forms
 New data types
 Streaming data, multimedia data
 Efficient search and access to data/knowledge
 Intelligent update and integration
 Privacy Concerns
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Results of Data Mining Include:
 Forecasting what may happen in the future
 Classifying people or things into groups by
recognizing patterns
 Clustering people or things into groups based on
their attributes
 Associating what events are likely to occur together
 Sequencing what events are likely to lead to later
events
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Data Mining versus OLAP
OLAP - On-line
Analytical Processing
 Provides you with a very
good view of what is
happening, but can not
predict what will happen
in the future or why it is
happening
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Data Mining Versus Statistical Analysis
Data Mining
 Originally developed to act as
expert systems to solve
problems
 Less interested in the
mechanics of the technique
 If it makes sense then let’s use
it
 Does not require assumptions
to be made about data
 Can find patterns in very large
amounts of data
 Requires understanding of data
and business problem
Data Analysis
 Tests for statistical correctness of
models
 Are statistical assumptions of
models correct?
 Eg Is the R-Square good?
 Hypothesis testing
 Is the relationship significant?
 Use a t-test to validate
significance
 Tends to rely on sampling
 Techniques are not optimised for
large amounts of data
 Requires strong statistical skills
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Data Mining Taxonomy
Predictive Method
- …predict the value of a particular attribute…
Descriptive Method
- …foundation of human-interpretable patterns that
describe the data…
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Data Mining Tasks...





Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Deviation Detection [Predictive]
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Data Mining Tasks:
Classification
Learn a method for predicting the instance class from
pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Classification: Linear
Regression
 Linear Regression
w0 + w1 x + w2 y >= 0
 Regression computes
wi from data to
minimize squared error
to ‘fit’ the data
 Not flexible enough
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Classification: Decision
Trees
if X > 5 then blue
else if Y > 3 then blue
else if X > 2 then green
else blue
Y
3
2
5
X
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Example Decision Tree
Splitting Attributes
Tid Refund Marital
Status
Taxable
Income Cheat
1
125K
No
Yes
Single
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
The splitting attribute at a node is
determined based on the Gini index.
10
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Classification: Neural Networks
- efficiently model large and complex problems;
- may be used in classification problems or for
regressions;
- Starts with input layer => hidden layer => output
layer
3
1
4
6
2
Inputs
5
Hidden Layer
Output
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Neural Networks (cont.)
- can be easily implemented to run on massively
parallel computers;
- can not be easily interpret;
- require an extensive amount of training time;
- require a lot of data preparation (involve very careful
data cleansing, selection, preparation, and preprocessing);
- require sufficiently large data set and high signal-to
noise ratio.
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Classification Example
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
10
No
Single
90K
Yes
Training
Set
Learn
Classifier
Test
Set
Model
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Classification Application
 Direct Marketing
 Fraud Detection
 Customer Attrition/Churn
 Sky Survey Cataloging
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Data Mining Tasks:
Clustering
 Goal is to identify categories
 Natural grouping of customers
by processing all the available
data about them.
 Other applications
 market segmentation, discovering
affinity groups, and defect
analysis
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Kohonen Network
Description
 unsupervised
 seeks to
describe dataset
in terms of
natural clusters
of cases
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Data Mining Tasks:
Association Rule Discovery
 Given a set of records each of which contain some
number of items from a given collection;
 Produce dependency rules which will predict occurrence
of an item based on occurrences of other items.
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Association Rule
Discovery Application
 Marketing and Sales Promotion
 Supermarket Shelf Management
 Inventory Management
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Deviation Detection & Pattern Discovery
Deviation Detection:
…discovering most significant changes in data from
previously measured or normative values…
V. Kumar, M. Joshi, Tutorial on High Performance Data Mining.
Sequential Pattern Discovery:
…process of looking for patterns and rules that predict
strong sequential dependencies among different
events…
V. Kumar, M. Joshi, Tutorial on High Performance Data Mining.
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Sequential Patterns
 Identify frequently occurring sequences from given
records
 40 percent of female customers buy a gray skirt six
months after buying a red jacket
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Data Mining Methodology: SAS
 Sample
 Extract a portion of the dataset for data mining
 Explore
 Modify
 create, select and transform variables with the intention of building
a model
 Model
 Specify a relationship of variables that reliably predicts a desired
goal
 Assess
 Evaluate the practical value of the findings and the model resulting
from the data mining effort
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Data Mining Methodology:
CRISP-DM





Data understanding
Data preparation
Modeling
Evaluation
Deployment
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
‫‪CRISP-DM Phases‬‬
‫سيستمهاي خبره و مهندسي دانش‪ -‬دكتر كاهاني‬
Phases and Tasks
Business
Understanding
Determine
Business Objectives
Background
Business Objectives
Business Success
Criteria
Situation Assessment
Inventory of Resources
Requirements,
Assumptions, and
Constraints
Risks and Contingencies
Terminology
Costs and Benefits
Determine
Data Mining Goal
Data Mining Goals
Data Mining Success
Criteria
Data
Understanding
Collect Initial Data
Initial Data Collection
Report
Data
Preparation
Data Set
Data Set Description
Select Data
Data Description Report
Rationale for Inclusion /
Exclusion
Explore Data
Clean Data
Describe Data
Data Exploration Report
Verify Data Quality
Data Quality Report
Data Cleaning Report
Construct Data
Derived Attributes
Generated Records
Integrate Data
Merged Data
Format Data
Modeling
Select Modeling
Technique
Modeling Technique
Modeling Assumptions
Generate Test Design
Test Design
Build Model
Parameter Settings
Models
Model Description
Assess Model
Model Assessment
Revised Parameter
Settings
Evaluation
Evaluate Results
Assessment of Data
Mining Results w.r.t.
Business Success
Criteria
Approved Models
Review Process
Review of Process
Determine Next Steps
List of Possible Actions
Decision
Deployment
Plan Deployment
Deployment Plan
Plan Monitoring and
Maintenance
Monitoring and
Maintenance Plan
Produce Final Report
Final Report
Final Presentation
Review Project
Experience
Documentation
Reformatted Data
Produce Project Plan
Project Plan
Initial Asessment of
Tools and Techniques
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Major Application Areas for
Data Mining Solutions
Fraud/Non-Compliance
Anomaly detection
 Isolate the factors that lead to
fraud, waste and abuse
 Target auditing and
investigative efforts more
effectively
Credit/Risk Scoring
Intrusion detection
Parts failure prediction
Recruiting/Attracting
customers
Maximizing profitability
(cross selling, identifying
profitable customers)
Service Delivery and
Customer Retention
 Build profiles of customers likely
to use which services
Web Mining
Health Care
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Controversial Issues
 Data mining (or simple analysis) on people may come with a
profile that would raise controversial issues of
 Discrimination
 Privacy
 Security
 Examples:
 Should males between 18 and 35 from countries that produced
terrorists be singled out for search before flight?
 Can people be denied mortgage based on age, sex, race?
 Women live longer. Should they pay less for life insurance?
34
Data Mining and
Discrimination
 Can discrimination be based on features like sex,
age, national origin?
 In some areas (e.g. mortgages, employment), some
features cannot be used for decision making
 In other areas, these features are needed to assess the
risk factors
 E.g. people of African descent are more susceptible to
sickle cell anemia
35
Data Mining and Privacy
 Can information collected for one purpose be used for mining
data for another purpose
 In Europe, generally no, without explicit consent
 In US, generally yes
 Companies routinely collect information about customers and
use it for marketing, etc.
 People may be willing to give up some of their privacy in
exchange for some benefits
 See Data Mining And Privacy Symposium,
www.kdnuggets.com/gpspubs/ieee-expert-9504-priv.html
36
Data Mining and Privacy
 Data Mining looks for patterns, not people!
 Technical solutions can limit privacy invasion
 Replacing sensitive personal data with anon. ID
 Give randomized outputs
 Multi-party computation – distributed data
…
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
The Hype Curve for
Data Mining and Knowledge Discovery
Over-inflated
expectations
Growing acceptance
and mainstreaming
rising
expectations
Disappointment
Performance
Expectations
1990
1998
2000
2002
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Final Remarks
 Data Mining can be utilized for any field that
needs to find patterns or relationships in their
data.
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬