Basic Data Mining Techniques

Download Report

Transcript Basic Data Mining Techniques

Basic Data Mining Techniques
Chapter 3
3.1 Decision Trees
An Algorithm for Building
Decision Trees
•
1. Let T be the set of training instances.
2. Choose an attribute that best differentiates the instances in T.
3. Create a tree node whose value is the chosen attribute.
-Create child links from this node where each link represents
a unique value for the chosen attribute.
-Use the child link values to further subdivide the instances
into subclasses.
4. For each subclass created in step 3:
-If the instances in the subclass satisfy predefined criteria
or if the set of remaining attribute choices for this path is null,
specify the classification for new instances following this decision
path.
-If the subclass does not satisfy the criteria and there is at
least one attribute to further subdivide the path of the tree, let
T be the current set of subclass instances and return to step 2.
Table 3.1 • The Credit Card Promotion Database
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
Income
Range
20-30K
2 Yes
2 No
30-40K
4 Yes
1 No
40-50K
50-60K
1 Yes
3 No
Figure 3.1 A partial decision
tree with root node = income
2 Yes
Credit
Card
Insurance
No
6 Yes
6 No
Yes
3 Yes
0 No
Figure 3.2 A partial decision
tree with root node = credit
Age
<= 43
9 Yes
3 No
> 43
0 Yes
3 No
Figure 3.3 A partial decision
tree with root node = age
Decision Trees for the
Credit Card Promotion
Database
Age
<= 43
> 43
No (3/0)
Sex
Female
Male
Yes (6/0)
Credit
Card
Insurance
No
No (4/1)
Yes
Yes (2/0)
Figure 3.4 A three-node
decision tree for the credit
Credit
Card
Insurance
No
Yes
Yes (3/0)
Sex
Female
Yes (6/1)
Male
No (6/1)
Figure 3.5 A two-node decision
treee for the credit card
Table 3.2 • Training Data Instances Following the Path in Figure 3.4 to Credit Card
Insurance = No
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
40–50K
20–30K
30–40K
20–30K
No
No
No
Yes
No
No
No
No
Male
Male
Male
Male
42
27
43
29
Decision Tree Rules
A Rule for the Tree in
Figure 3.4
• IF Age <=43 & Sex = Male
& Credit Card Insurance = No
THEN Life Insurance Promotion = No
A Simplified Rule Obtained
by Removing Attribute Age
• IF Sex = Male & Credit Card
Insurance = No THEN Life Insurance
Promotion = No
Other Methods for Building
Decision Trees
• CART
• CHAID
Advantages of Decision
Trees
• Easy to understand.
•
•
•
•
Map nicely to a set of production rules.
Applied to real problems.
Make no prior assumptions about the data.
Able to process both numerical and
categorical data.
Disadvantages of
Decision Trees
• Output attribute must be categorical.
• Limited to one output attribute.
• Decision tree algorithms are unstable.
• Trees created from numeric datasets
can be complex.
3.2 Generating Association
Rules
Confidence and Support
Rule Confidence
• Given a rule of the form “If A then
B”, rule confidence is the conditional
probability that B is true when A is
known to be true.
Rule Support
• The minimum percentage of instances
in the database that contain all items
listed in a given association rule.
Mining Association Rules:
An Example
Table 3.3 • A Subset of the Credit Card Promotion Database
Magazine
Promotion
Yes
Yes
No
Yes
Yes
No
Yes
No
Yes
Yes
Watch
Promotion
Life Insurance
Promotion
Credit Card
Insurance
Sex
No
Yes
No
Yes
No
No
No
Yes
No
Yes
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Table 3.4 • Single-Item Sets
Single-Item Sets
Magazine Promotion = Yes
Watch Promotion = Yes
Watch Promotion = No
Life Insurance Promotion = Yes
Life Insurance Promotion = No
Credit Card Insurance = No
Sex = Male
Sex = Female
Number of Items
7
4
6
5
5
8
6
4
Table 3.5 • Two-Item Sets
Two-Item Sets
Number of Items
Magazine Promotion = Yes & Watch Promotion = No
Magazine Promotion = Yes & Life Insurance Promotion = Yes
Magazine Promotion = Yes & Credit Card Insurance = No
Magazine Promotion = Yes & Sex = Male
Watch Promotion = No & Life Insurance Promotion = No
Watch Promotion = No & Credit Card Insurance = No
Watch Promotion = No & Sex = Male
Life Insurance Promotion = No & Credit Card Insurance = No
Life Insurance Promotion = No & Sex = Male
Credit Card Insurance = No & Sex = Male
Credit Card Insurance = No & Sex = Female
4
5
5
4
4
5
4
5
4
4
4
General Considerations
• We are interested in association rules that
show a lift in product sales where the lift
is the result of the product’s association
with one or more other products.
• We are also interested in association rules
that show a lower than expected
confidence for a particular association.
3.3 The K-Means
Algorithm
1.
2.
3.
4.
5.
Choose a value for K, the total number of
clusters.
Randomly choose K points as cluster
centers.
Assign the remaining instances to their
closest cluster center.
Calculate a new cluster center for each
cluster.
Repeat steps 3-5 until the cluster
centers do not change.
An Example Using K-Means
Table 3.6
• K-Means Input Values
Instance
X
Y
1
2
3
4
5
6
1.0
1.0
2.0
2.0
3.0
5.0
1.5
4.5
1.5
3.5
2.5
6.0
f(x)
7
6
5
4
3
2
1
0
x
0
1
2
3
4
Figure 3.6 A coordinate mapping
of the data in Table 3.6
5
6
Table 3.7 • Several Applications of the K-Means Algorithm (K = 2)
Outcome
Cluster Centers
Cluster Points
1
(2.67,4.67)
2, 4, 6
Squared Error
14.50
2
(2.00,1.83)
1, 3, 5
(1.5,1.5)
1, 3
(2.75,4.125)
2, 4, 5, 6
(1.8,2.7)
1, 2, 3, 4, 5
15.94
3
9.60
(5,6)
6
f(x)
7
6
5
4
3
2
1
0
x
0
1
2
3
4
Figure 3.7 A K-Means clustering
of the data in Table 3.6 (K = 2)
5
6
General Considerations
• Requires real-valued data.
• We must select the number of clusters
present in the data.
• Works best when the clusters in the data
are of approximately equal size.
• Attribute significance cannot be determined.
• Lacks explanation capabilities.
3.4 Genetic Learning
Genetic Learning Operators
• Selection
• Crossover
• Mutation
Genetic Algorithms and
Supervised Learning
Keep
Population
Elements
Fitness
Function
Training
Data
Candidates
for Crossover
& Mutation
Figure 3.8 Supervised genetic
learning
Throw
Table 3.8 • An Initial Population for Supervised Genetic Learning
Population
Element
1
2
3
4
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
20–30K
30–40K
?
30–40K
No
Yes
No
Yes
Yes
No
No
Yes
Male
Female
Male
Male
30–39
50–59
40–49
40–49
Table 3.9 • Training Data for Genetic Learning
Training
Instance
1
2
3
4
5
6
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
30–40K
30–40K
50–60K
20–30K
20–30K
30–40K
Yes
Yes
Yes
No
No
No
Yes
No
No
No
No
No
30–39
40–49
30–39
50–59
20–29
40–49
Male
Female
Female
Female
Male
Male
Population
Element
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
Population
Element
Income
Range
#1
20-30K
No
Yes
Male
30-39
#2
30-40K
Population
Element
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
Population
Element
Income
Range
#2
30-40K
Yes
No
Fem
50-59
#1
20-30K
Figure 3.9 A crossover
operation
Life Insurance Credit Card
Promotion
Insurance
Yes
Yes
Life Insurance Credit Card
Promotion
Insurance
No
No
Sex
Age
Male
30-39
Sex
Age
Fem
50-59
Table 3.10 • A Second-Generation Population
Population
Element
1
2
3
4
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
20–30K
30–40K
?
30–40K
No
Yes
No
Yes
No
Yes
No
Yes
Female
Male
Male
Male
50–59
30–39
40–49
40–49
Genetic Algorithms and
Unsupervised Clustering
a1
a2
. . .
a3
E11
S1
E12
an
I1
P
instances
I2
.
.
.
.
.
Ip
E21
S2
.
.
.
.
.
.
.
SK
Solutions
Figure 3.10 Unsupervised
genetic clustering
E22
Ek1
Ek2
Table 3.11 • A First-Generation Population for Unsupervised Clustering
S
1
S
2
S
3
Solution elements
(initial population)
(1.0,1.0)
(5.0,5.0)
(3.0,2.0)
(3.0,5.0)
(4.0,3.0)
(5.0,1.0)
Fitness score
11.31
9.78
15.55
Solution elements
(second generation)
(5.0,1.0)
(5.0,5.0)
(3.0,2.0)
(3.0,5.0)
(4.0,3.0)
(1.0,1.0)
Fitness score
17.96
9.78
11.34
Solution elements
(third generation)
(5.0,5.0)
(1.0,5.0)
(3.0,2.0)
(3.0,5.0)
(4.0,3.0)
(1.0,1.0)
Fitness score
13.64
9.78
11.34
General Considerations
• Global optimization is not a guarantee.
• The fitness function determines the
complexity of the algorithm.
• Explain their results provided the fitness
function is understandable.
• Transforming the data to a form suitable
for genetic learning can be a challenge.
3.5 Choosing a Data
Mining Technique
Initial Considerations
• Is learning supervised or unsupervised?
• Is explanation required?
• What is the interaction between input and
output attributes?
• What are the data types of the input and
output attributes?
Further Considerations
•
Do We Know the Distribution of the Data?
• Do We Know Which Attributes Best
Define the Data?
• Does the Data Contain Missing Values?
• Is Time an Issue?
• Which Technique Is Most Likely to Give a
Best Test Set Accuracy?