Roiger_DM_ch01 - Gonzaga University

Download Report

Transcript Roiger_DM_ch01 - Gonzaga University

Part I
Data Mining Fundamentals
Chapter 1
Data Mining: A First View
Jason C. H. Chen, Ph.D.
Professor of MIS
School of Business Administration
Gonzaga University
Spokane, WA 99223
[email protected]
A/W & Dr. Chen, Data Mining
1.1 Data Mining: A Definition
A/W & Dr. Chen, Data Mining
1.1 Data Mining: A Definition
• The process of employing one or more
computer learning techniques to
automatically analyze and extract
knowledge from data.
3
A/W & Dr. Chen, Data Mining
Induction-based Learning
• The process of forming general concept
definitions by observing specific examples
of concepts to be learned.
Knowledge Discovery in Databases
(KDD)
• The application of the scientific method to
data mining. Data mining is one step of the
KDD process.
4
A/W & Dr. Chen, Data Mining
Data Mining Examples
• A telephone company used a data mining tool to
analyze their customer’s data warehouse. The data
mining tool found about 10,000 supposedly
residential customers that were expending over
$1,000 monthly in phone bills.
• After further study, the phone company
discovered that they were really small business
owners trying to avoid paying business rates
*
5
A/W & Dr. Chen, Data Mining
Other Data Mining Examples
• 65% of customers who did not use the credit card
in the last six months are 88% likely to cancel
their accounts.
• If age < 30 and income <= $25,000 and credit
rating < 3 and credit amount > $25,000 then the
minimum loan term is 10 years.
• 82% of customers who bought a new TV 27" or
larger are 90% likely to buy an entertainment
center within the next 4 weeks.
6
A/W & Dr. Chen, Data Mining
1.2 What Can Computers Learn?
7
A/W & Dr. Chen, Data Mining
Four Levels of Learning
• Fact
– a simple statement of truth
• Concept
– a set of objects, symbols, or events grouped together because
they share certain characteristics
• Principle
– is a step-by-step course of action to achieve a goal. We use
procedures in our everyday functioning as well as in the
solution of difficult problems
• Procedure
– represents the highest level of learning. Principles are general
truths or laws that are basic to other truths.
Source: Merril and Tennyson, 1977, p.5 of the text
A/W & Dr. Chen, Data Mining
8 N
Concepts
• Computers are good at learning concepts.
Concepts are the output of a data mining
session.
Three Concept Views
• Classical View
• Probabilistic View
• Exemplar View
9
A/W & Dr. Chen, Data Mining
Three Concept Views
• Classical View
– Attests that all concepts have definite defining
properties.
• Probabilistic View
– Concepts are represented by properties that are
probable of concept members.
• Exemplar View
– States that a given instance is determined to be an
example of a particular concept if the instance is similar
enough to a set of one or more known examples of the
concepts
10
A/W & Dr. Chen, Data Mining
Figure - A hierarchy of data mining strategies
Data Mining
Strategies
Unsupervised
Clustering
Market Basket
Analysis
Supervised
Learning
No output attributes
Classification
Categorical/discrete
(current behavior)
A/W & Dr. Chen, Data Mining
Prediction
Estimation
Numeric
Future outcome
(categorical/numeric)11
Supervised Learning
Supervised learning is the process of building
classification models using data instances of known
origin.
Two purposes:
• 1. Build a learner (classification) model using data
instances of known origin.
– is an induction process
• 2. Use the model to determine the outcome new
instances of unknown origin.
– is a deduction process
12
A/W & Dr. Chen, Data Mining
Supervised Learning:
A Decision Tree Example
A/W & Dr. Chen, Data Mining
Decision Tree
• A tree structure where non-terminal nodes
represent tests on one or more attributes and
terminal nodes reflect decision outcomes.
Table 1.1 – Hypothetical Training Data for Disease Diagnosis
Patient
ID#
Sore
Throat
Fever
Swollen
Glands
Congestion
1
Yes
Yes
Yes
Yes
Yes
Strep throat
2
No
No
No
Yes
Yes
Allergy
3
Yes
Yes
No
Yes
No
Cold
4
Yes
No
Yes
No
No
Strep throat
5
No
Yes
No
Yes
No
Cold
6
No
No
No
Yes
No
Allergy
7
No
No
Yes
No
No
Strep throat
8
Yes
No
No
Yes
Yes
Allergy
9
No
Yes
No
Yes
Yes
Cold
10
Yes
Yes
No
Yes
Yes
Cold
14
A/W & Dr. Chen, Data Mining
Headache Diagnosis
Figure 1.1 – A decision tree for
the data in Table 1.1
Swollen
Glands
No
Yes
Diagnosis = Strep Throat
Fever
No
Diagnosis = Allergy
Yes
Diagnosis = Cold
15
A/W & Dr. Chen, Data Mining
Table 1.1 – Hypothetical Training Data for Disease Diagnosis
Patient
ID#
Sore
Throat
Fever
Swollen
Glands
Congestion
1
Yes
Yes
Yes
Yes
Yes
Strep throat
2
No
No
No
Yes
Yes
Allergy
3
Yes
Yes
No
Yes
No
Cold
4
Yes
No
Yes
No
No
Strep throat
5
No
Yes
No
Yes
No
Cold
6
No
No
No
Yes
No
Allergy
7
No
No
Yes
No
No
Strep throat
8
Yes
No
No
Yes
Yes
Allergy
9
No
Yes
No
Yes
Yes
Cold
10
Yes
Yes
No
Yes
Yes
Cold
Headache Diagnosis
Table 1.2 Data Instances with an Unknown Classification
Patient
ID#
Sore
Throat
Fever
Swollen
Glands
11
No
No
Yes
Yes
Yes
?
12
Yes
Yes
No
No
Yes
?
13
No
No
No
No
Yes
?
Congestion Headache Diagnosis
16
A/W & Dr. Chen, Data Mining
Production Rules
We can translate any decision tree into a set of production
rules. They are rules of the form:
IF <antecedent conditions>
THEN <consequent conditions>
• IF Swollen Glands = Yes
THEN Diagnosis = Strep Throat
• IF Swollen Glands = No & Fever = Yes
THEN Diagnosis = Cold
• IF Swollen Glands = No & Fever = No
THEN Diagnosis = Allergy
17
A/W & Dr. Chen, Data Mining
Unsupervised Clustering
• A data mining method that builds models from
data without predefined classes (see Table 1.3).
• Data instances are grouped together based on a
similarity scheme defined by the clustering
system.
• With the help of one or several evaluation
techniques, it is up to us to decide the meaning of
the formed clusters.
18
A/W & Dr. Chen, Data Mining
Table 1.3 – Acme Investors Incorporated
Customer Account Margin
Transaction Trades/
Favorite
Annual
ID
Type
Account
Method
Month
Sex
Age
Recreation Income
1005
Joint
No
Online
12.5
F
30–39
Tennis
40–59K
1013
Custodial
No
Broker
0.5
F
50–59
Skiing
80–99K
1245
Joint
No
Online
3.6
M
20–29
Golf
20–39K
2110
Individual
Yes
Broker
22.3
M
30–39
Fishing
40–59K
1001
Individual
Yes
Online
5
M
40–49
Golf
60–79K
19
A/W & Dr. Chen, Data Mining
Possible Questions
Questions for supervised learning
1. Can I develop a general profile of an online investor? If so, what
characteristics distinguish online investors from investors that use
a broker?
2. Can I determine if a new customer who does not initially open a
margin account is likely to do so in the future?
3. Can I build a model able to accurately predict the average number
of trades per month for a new investor?
4. What characteristics differentiate female and male investors?
Questions for unsupervised learning
1. What attribute similarities group customers of Acme Investors together?
2. What differences in attribute values segment the customer database?
20
A/W & Dr. Chen, Data Mining
1.3 Is Data Mining Appropriate
for My Problem?
21
A/W & Dr. Chen, Data Mining
Data Mining or Data Query?
• Shallow Knowledge
– is factual; tools used: DBMS/SQL
• Multidimensional Knowledge
– Is factual; tools used: OLAP
• Hidden Knowledge
– Represents patterns or regularities in data that cannot be
easily found, tools used: data mining
• Deep Knowledge
– Knowledge stored in a database that can only be found
if we are given some direction.
22
A/W & Dr. Chen, Data Mining
Data Mining vs. Data Query: An
Example
• Use data query if you already almost know
what you are looking for.
• Use data mining to find regularities in data
that are not obvious.
23
A/W & Dr. Chen, Data Mining
1.4 Expert Systems or Data
Mining?
24
A/W & Dr. Chen, Data Mining
Expert System and Knowledge
Engineer
• An expert system is a computer program
that emulates the problem-solving skills of
one or more human experts.
• A knowledge engineer is a person trained to
interact with an expert in order to capture
their knowledge.
25
A/W & Dr. Chen, Data Mining
Data
Data Mining Tool
If Swollen Glands = Yes
Then Diagnosis = Strep Throat
Human Expert
Knowledge Engineer
Expert System
Building Tool
If Swollen Glands = Yes
Then Diagnosis = Strep Throat
26
A/W & Dr. Chen, Data Mining
1.5 A Simple Data Mining Process
Model
27
A/W & Dr. Chen, Data Mining
Figure 1.3 - A simples data mining process model
Operational
Database
SQL Queries
Interpretation &
Data
Warehouse
Data Mining
Evaluation
Result
Application
28
A/W & Dr. Chen, Data Mining
Characteristics of Data Warehouse
• Data Warehouse:
– Definitions: a subject-oriented, integrated, timevariant, non-updatable collection of data used in
support of management decision-making processes
– Subject-oriented: e.g. customers, patients, students,
products
– Integrated: Consistent naming conventions, formats,
encoding structures; from multiple data sources
– Time-variant: Can study trends and changes
– Nonupdatable: Read-only, periodically refreshed
• Data Mart:
– A data warehouse that is limited in scope
29
A/W & Dr. Chen, Data Mining
A four-step process for performing
a data mining session
•
1. Assembling the data
– Operational database (relational databases and flat
files) vs. data warehouse
•
2. Mining the Data (Giving the data to a mining
tool)
– Instances for building the model or testing the model
•
•
3. Interpreting the results
4. Result application
30
A/W & Dr. Chen, Data Mining
1.7 Data Mining Applications (p.24)
•
•
•
•
•
Fraud Detection
Health care
Business and finance
Scientific applications
Sports and gaming
31
A/W & Dr. Chen, Data Mining
Customer Intrinsic Value
_
_
_
_
_B
_
_
Intrinsic
(Predicted)
Value
_
_
_
X
X
A
_
X
X
X
X
C
X
X
X
Actual Value
32
A/W & Dr. Chen, Data Mining