Roiger_DM_ch03 - Gonzaga University

Download Report

Transcript Roiger_DM_ch03 - Gonzaga University

Chapter 3
Basic Data Mining Techniques
Jason C. H. Chen, Ph.D.
Professor of MIS
School of Business Administration
Gonzaga University
Spokane, WA 99223
[email protected]
A/W & Dr. Chen, Data Mining
Objectives
• The chapter introduces several common data
mining techniques.
• In Section 3.1, it focus on supervised learning by
presenting a standard algorithm for creating
decision trees.
• In Section 3.2, an efficient technique for
generating association rules is presented.
• In Section 3.3, unsupervised clustering and the KMeans algorithm are illustrated.
• Section 3.4 show you how genetic algorithms can
perform supervised learning and unsupervised
clustering. (to be skipped)
A/W & Dr. Chen, Data Mining
3.1 Decision Trees
• Decision trees are constructed using only those
attributes best able to differentiate the concepts to be
learned.
• A decision tree is built by initially selecting a subset
of instances from a training set.
• The subset is then used the algorithm to construct a
decision tree. The remaining training set instances test
the accuracy of the constructed tree.
– If the decision tree classifies the instances correctly, the
procedure terminates.
– If an instance is incorrectly classified, the instance is added
to the selected subset of training instances and a new tree is
constructed.
A/W & Dr. Chen, Data Mining
An Algorithm for Building
Decision Trees
1. Let T be the set of training instances.
2. Choose an attribute that best differentiates the instances in T.
3. Create a tree node whose value is the chosen attribute.
-Create child links from this node where each link represents a unique value
for the chosen attribute.
-Use the child link values to further subdivide the instances into subclasses.
4. For each subclass created in step 3:
a. If the instances in the subclass satisfy predefined criteria or if the set of
remaining attribute choices for this path is null, specify the classification
for new instances following this decision path.
b. If the subclass does not satisfy the criteria and there is at least one attribute
to further subdivide the path of the tree, let T be the current set of subclass
instances and return to step 2.
A/W & Dr. Chen, Data Mining
Table 3.1 • The Credit Card Promotion Database (same as Table2.3)
Income Magazine
Watch
Life Insurance Credit Card
Range ($) Promotion Promotion Promotion
Insurance
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
A/W & Dr. Chen, Data Mining
Yes
Yes
No
Yes
Yes
No
Yes
No
Yes
Yes
No
No
Yes
No
No
No
Yes
No
Yes
No
No
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Sex
Age
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
Income
Range
20-30K
2 Yes
2 No
A/W & Dr. Chen, Data Mining
30-40K
4 Yes
1 No
40-50K
1 Yes
3 No
Figure 3.1 A partial decision tree with
root node = income range
50-60K
2 Yes
Credit
Card
Insurance
No
Yes
3 Yes
0 No
6 Yes
6 No
A/W & Dr. Chen, Data Mining
Figure 3.2 A partial decision tree with
root node = credit card insurance
Age
<= 43
9 Yes
3 No
A/W & Dr. Chen, Data Mining
> 43
0 Yes
3 No
Figure 3.3 A partial decision tree with
root node = age
Decision Trees for the Credit
Card Promotion Database
A/W & Dr. Chen, Data Mining
Age
<= 43
> 43
No (3/0)
Sex
Female
Male
Yes (6/0)
Credit
Card
Insurance
No
No (4/1)
A/W & Dr. Chen, Data Mining
Yes
Yes (2/0)
Figure 3.4 A three-node decision tree
for the credit card database
Credit
Card
Insurance
No
Yes
Yes (3/0)
Sex
Female
Yes (6/1)
A/W & Dr. Chen, Data Mining
Male
No (6/1)
Figure 3.5 A two-node decision treee
for the credit card database
Table 3.2 • Training Data Instances Following the
Path in Figure 3.4 to Credit Card Insurance = No
Income
Range
40–50K
20–30K
30–40K
20–30K
A/W & Dr. Chen, Data Mining
Life Insurance Credit Card
Promotion
Insurance
No
No
No
Yes
No
No
No
No
Sex
Age
Male
Male
Male
Male
42
27
43
29
Decision Tree Rules
A/W & Dr. Chen, Data Mining
A Rule for the Tree in Figure
3.4
IF Age <=43 & Sex = Male
& Credit Card Insurance = No
THEN Life Insurance Promotion = No
A/W & Dr. Chen, Data Mining
A Simplified Rule Obtained by
Removing Attribute Age
IF Sex = Male & Credit Card Insurance = No
THEN Life Insurance Promotion = No
A/W & Dr. Chen, Data Mining
Other Methods for Building
Decision Trees
• CART
• CHAID
A/W & Dr. Chen, Data Mining
Advantages of Decision Trees
• Easy to understand.
• Map nicely to a set of production rules.
• Applied to real problems.
• Make no prior assumptions about the data.
• Able to process both numerical and categorical data.
A/W & Dr. Chen, Data Mining
Disadvantages of Decision
Trees
• Output attribute must be categorical.
• Limited to one output attribute.
• Decision tree algorithms are unstable.
• Trees created from numeric datasets can be complex.
A/W & Dr. Chen, Data Mining
3.2 Generating Association
Rules
A/W & Dr. Chen, Data Mining
Confidence and Support
A/W & Dr. Chen, Data Mining
Rule Confidence
Given a rule of the form “If A then
B”, rule confidence is the conditional
probability that B is true when A is
known to be true.
A/W & Dr. Chen, Data Mining
Rule Support
The minimum percentage of instances
in the database that contain all items
listed in a given association rule.
A/W & Dr. Chen, Data Mining
Mining Association Rules:
An Example
A/W & Dr. Chen, Data Mining
Table 3.3 • A Subset of the Credit Card Promotion
Database
Magazine
Watch
Promotion
Yes
Yes
No
Yes
Yes
No
Yes
No
Yes
Yes
A/W & Dr. Chen, Data Mining
Credit Card
Promotion
Life
Insurance
Promotion
Insurance
Sex
No
Yes
No
Yes
No
No
No
Yes
No
Yes
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Table 3.4 • Single-Item Sets
Single-Item Sets
Magazine Promotion = Yes
Watch Promotion = Yes
Watch Promotion = No
Life Insurance Promotion = Yes
Life Insurance Promotion = No
Credit Card Insurance = No
Sex = Male
Sex = Female
A/W & Dr. Chen, Data Mining
Number of Items
7
4
6
5
5
8
6
4
Table 3.5 • Two-Item Sets
Two-Item Sets
Magazine Promotion = Yes & Watch Promotion =
No
Magazine Promotion = Yes & Life Insurance
Promotion = Yes
Magazine Promotion = Yes & Credit Card
Insurance = No
Magazine Promotion = Yes & Sex = Male
Watch Promotion = No & Life Insurance
Promotion = No
Watch Promotion = No & Credit Card Insurance
= No
Watch Promotion = No & Sex = Male
Life Insurance Promotion = No & Credit Card
Insurance = No
Life Insurance Promotion = No & Sex = Male
Credit Card Insurance = No & Sex = Male
Credit Card Insurance = No & Sex = Female
A/W & Dr. Chen, Data Mining
Number of Items
4
5
5
4
4
5
4
5
4
4
4
General Considerations
• We are interested in association rules that show a lift in
product sales where the lift is the result
of
the
product’s association with one or more
other
products.
• We are also interested in association rules that show a
lower than expected confidence for a particular
association.
A/W & Dr. Chen, Data Mining
Up here for now!
A/W & Dr. Chen, Data Mining
3.3 The K-Means Algorithm
1. Choose a value for K, the total number of clusters.
2. Randomly choose K points as cluster centers.
3. Assign the remaining instances to their closest
cluster center.
4. Calculate a new cluster center for each cluster.
5. Repeat steps 3-5 until the cluster centers do not
change.
A/W & Dr. Chen, Data Mining
An Example Using K-Means
A/W & Dr. Chen, Data Mining
Table 3.6
• K-Means Input Values
Instance
X
Y
1
2
3
4
5
6
1.0
1.0
2.0
2.0
3.0
5.0
1.5
4.5
1.5
3.5
2.5
6.0
A/W & Dr. Chen, Data Mining
f(x)
7
6
5
4
3
2
1
0
x
0
A/W & Dr. Chen, Data Mining
1
2
3
Figure 3.6 A coordinate mapping of
the data in Table 3.6
4
5
6
Table 3.7 • Several Applications of the K-Means Algorithm (K = 2)
Outcome
1
Cluster Centers Cluster Points
(2.67,4.67)
Squared Error
2, 4, 6
14.50
2
(2.00,1.83)
1, 3, 5
(1.5,1.5)
1, 3
15.94
3
(2.75,4.125)
2, 4, 5, 6
(1.8,2.7)
1, 2, 3, 4, 5
9.60
(5,6)
A/W & Dr. Chen, Data Mining
6
f(x)
7
6
5
4
3
2
1
0
x
0
A/W & Dr. Chen, Data Mining
1
2
3
4
Figure 3.7 A K-Means clustering of
the data in Table 3.6 (K = 2)
5
6
General Considerations
• Requires real-valued data.
• We must select the number of clusters present in the
data.
• Works best when the clusters in the data are of
approximately equal size.
• Attribute significance cannot be determined.
• Lacks explanation capabilities.
A/W & Dr. Chen, Data Mining
3.4 Genetic Learning
A/W & Dr. Chen, Data Mining
Genetic Learning Operators
• Crossover
• Mutation
• Selection
A/W & Dr. Chen, Data Mining
Genetic Algorithms and
Supervised Learning
A/W & Dr. Chen, Data Mining
Keep
Population
Elements
Fitness
Function
Training
Data
Candidates
for Crossover
& Mutation
A/W & Dr. Chen, Data Mining
Figure 3.8 Supervised genetic
learning
Throw
Table 3.8 • An Initial Population for Supervised
Genetic Learning
Population
Income
Range
Life
Insurance
Promotion
Credit
Card
Insurance
Element
1
2
20–30K
30–40K
No
Yes
Yes
No
3
4
?
30–40K
No
Yes
No
Yes
A/W & Dr. Chen, Data Mining
Sex
Age
Male
Femal
e
Male
Male
30–39
50–59
40–49
40–49
Table 3.9 • Training Data for Genetic Learning
Training
Income
Instance
1
2
3
4
5
6
A/W & Dr. Chen, Data Mining
Range
Life
Insurance
Promotion
Credit
Card
Insurance Sex
Age
30–40K
30–40K
50–60K
20–30K
20–30K
30–40K
Yes
Yes
Yes
No
No
No
Yes
No
No
No
No
No
30–39
40–49
30–39
50–59
20–29
40–49
Male
Female
Female
Female
Male
Male
Population Income Life Insurance Credit Card
Sex Age
Element Range Promotion Insurance
#1
20-30K
No
Yes
Male 30-39
Population Income Life Insurance Credit Card
Sex Age
Element Range Promotion Insurance
#2
30-40K
Yes
No
Fem 50-59
Population Income Life Insurance Credit Card
Sex Age
Element Range Promotion Insurance
#2
Yes
Yes
Male 30-39
Population Income Life Insurance Credit Card
Sex Age
Element Range Promotion Insurance
#1
Figure 3.9 A crossover operation
A/W & Dr. Chen, Data Mining
30-40K
20-30K
No
No
Fem 50-59
Table 3.10 • A Second-Generation Population
Population
Income
Element
1
2
3
4
A/W & Dr. Chen, Data Mining
Range
Life
Insurance
Promotion
Credit
Card
Insurance
Sex
Age
20–30K
30–40K
?
30–40K
No
Yes
No
Yes
No
Yes
No
Yes
Female
Male
Male
Male
50–59
30–39
40–49
40–49
Genetic Algorithms and
Unsupervised Clustering
A/W & Dr. Chen, Data Mining
a1
a2
. . .
a3
E11
S1
E12
an
I1
P
instances
I2
.
.
.
.
.
Ip
E21
S2
.
.
.
.
.
.
.
SK
Solutions
A/W & Dr. Chen, Data Mining
Figure 3.10 Unsupervised genetic
clustering
E22
Ek1
Ek2
Table 3.11 • A First-Generation Population
for Unsupervised Clustering
S
1
S
2
S
3
Solution elements
(initial population)
(1.0,1.0)
(5.0,5.0)
(3.0,2.0)
(3.0,5.0)
(4.0,3.0)
(5.0,1.0)
Fitness score
11.31
9.78
15.55
Solution elements
(second generation)
(5.0,1.0)
(5.0,5.0)
(3.0,2.0)
(3.0,5.0)
(4.0,3.0)
(1.0,1.0)
Fitness score
17.96
9.78
11.34
Solution elements
(third generation)
(5.0,5.0)
(1.0,5.0)
(3.0,2.0)
(3.0,5.0)
(4.0,3.0)
(1.0,1.0)
Fitness score
13.64
9.78
11.34
A/W & Dr. Chen, Data Mining
General Considerations
• Global optimization is not a guarantee.
• The fitness function determines the
complexity of the algorithm.
• Explain their results provided the fitness
function is understandable.
• Transforming the data to a form suitable for
genetic learning can be a challenge.
A/W & Dr. Chen, Data Mining
3.5 Choosing a Data Mining
Technique
A/W & Dr. Chen, Data Mining
Initial Considerations
• Is learning supervised or unsupervised?
• Is explanation required?
• What is the interaction between input and output
attributes?
• What are the data types of the input and output
attributes?
A/W & Dr. Chen, Data Mining
Further Considerations
• Do We Know the Distribution of the Data?
• Do We Know Which Attributes Best Define the Data?
• Does the Data Contain Missing Values?
• Is Time an Issue?
• Which Technique Is Most Likely to Give a Best Test
Set Accuracy?
A/W & Dr. Chen, Data Mining
A/W & Dr. Chen, Data Mining