Data Mining A Tutorial

Download Report

Transcript Data Mining A Tutorial

An Excel-based Data Mining Tool
Chapter 4
4.1 The iData Analyzer
Interface
Data
PreProcessor
Large
Dataset
Yes
Heuristic
Agent
No
Mining
Technique
Neural
Networks
Explaination
No
ESX
Yes
Generate
Rules
Yes
RuleMaker
No
Report
Generator
Excel
Sheets
Figure 4.1 The iDA system
architecture
Rules
Figure 4.2 A successful installation
4.2 ESX: A Multipurpose Tool
for Data Mining
Root Level
Concept Level
Instance Level
Root
C1
I11 I12 . . .
C2
I1j
I21 I22 . . . I2k
Figure 4.3 An ESX concept hierarchy
...
Cn
In1 In2 . . .
Inl
4.3 iDAV Format for Data Mining
Table 4.1 • Credit Card Promotion Database: iDAV Format
Income
Range
Magazine
Promotion
Watch
Promotion
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
C
I
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
C
I
Yes
Yes
No
Yes
Yes
No
Yes
No
Yes
Yes
No
No
Yes
No
No
C
I
No
Yes
No
Yes
No
No
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
C
I
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
C
I
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
C
I
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
R
I
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
Table 4.2 • Values for Attribute Usage
Character
I
U
D
O
Usage
The attribute is used as an input
attribute.
The attribute is not used.
The attribute is not used for classification or clustering, but
attribute value summary information is displayed in all output
reports.
The attribute is used as an out put attribute. For supervised learning
w ith ESX, exactly one categorical attribute is selected as the output
attribute.
4.4 A Five-step Approach for
Unsupervised Clustering
Step 1: Enter the Data to be Mined
Step 2: Perform a Data Mining Session
Step 3: Read and Interpret Summary Results
Step 4: Read and Interpret Individual Class Results
Step 5: Visualize Individual Class Rules
Step 1: Enter The Data To Be Mined
Figure 4.4 The Credit Card
Promotion Database
Step 2: Perform A Data Mining
Session
Figure 4.5 Unsupervised settings for
ESX
Figure 4.6 RuleMaker options
Step 3: Read and Interpret
Summary Results
• Class Resemblance Scores
• Domain Resemblance Score
• Domain Predictability
Summary Results
• Class Resemblance Score offers a first
indication about how well the instances
within each class (cluster) fit together.
• Domain Resemblance Score represents the
overall similarity of all instances within the
data set.
• It is highly desirable that class resemblance
scores are higher that the domain
resemblance score
Summary Results
• Given categorical attribute A with values v1,
v2, v3, …, vi,… vn, the Domain Predictability
of vi tells us the domain instances showing vi
as a value for A.
• A predictability score near 100% for a
domain-level categorical attribute value
indicates that the attribute is not likely to be
useful for supervised learning or
unsupervised clustering
Summary Results
• Given categorical attribute A with values v1,
v2, v3, …, vi,… vn, the Class C Predictability
score for vi tells us the percent of instances
within class C shoving vi as a value for A.
• Given class C and categorical attribute A
with values v1, v2, v3, …, vi,… vn, an
Attribute-Value Predictiveness score for vi is
defined as the probability an instance resides
in C given the instance has value vi for A.
Domain Statistics for Numerical
Attributes
• Attribute Significance Value measures the
predictive value of each numerical attribute.
• To calculate the Attribute Significance
Value for a numeric attribute, it is necessary
to: a) subtract the smallest class mean from
the largest mean value; b) divide this result
by the domain standard deviation
Figure 4.8 Summery statistics for the
Acme credit card promotion database
Figure 4.9 Statistics for numerical
attributes and common categorical
Step 4: Read and Interpret
Individual Class Results
• Class Predictability is a within-class
measure.
• Class Predictiveness is a betweenclass measure.
Necessary and Sufficient
Attribute Values
• If an attribute value has a predictability and
predictiveness score of 1.0, the attribute
value is said to be necessary and sufficient
for membership in class C. That is, all
instances within class C have the specified
value for the attribute and all instances with
this value for the attribute reside in class C.
Sufficient Attribute Values
• If an attribute value has a predictiveness
score of 1.0 and a predictability score less
than 1.0, the attribute value is said to be
sufficient but not necessary for membership
in class C. That is, all instances with the
value for the attribute reside in C, but there
are other instances in C that have a
different value for this attribute.
Necessary Attribute Values
• If an attribute value has a predictability
score of 1.0 and a predictiveness score less
than 1.0, the attribute value is said to be
necessary but not sufficient for membership
in class C. That is, all instances in C have
the same value for the attribute, but there
are other instances outside C, have the same
value for this attribute.
Necessary and Sufficient
Attribute Values in iDA
• The attribute values with predictiveness
scores greater than or equal to 0.8 are
considered as highly sufficient.
• The attribute values with predictability
scores greater than or equal to 0.8 are
considered as necessary.
Figure 4.10 Class 3 summary results
Figure 4.11 Necessary and sufficient
attribute values for Class 3
Step 5: Visualize Individual
Class Rules
Figure 4.7 Rules for the credit card
promotion database
Rule Interpretation in iDA
• Each rule simply declares the
precondition(s) necessary for an instance to
be covered by the rule:
• if [(condition & condition &…&
condition)=true] then an instance resides
in a certain class.
Rule Interpretation in iDA
• Rule accuracy tells us the rule is accurate in
…% of all cases where it applies.
• Rule coverage shows that the rule applies
that the rule applies to …% of class
instances
4.5 A Six-Step Approach for
Supervised Learning
Step 1: Choose an Output Attribute
Step 2: Perform the Mining Session
Step 3: Read and Interpret Summary Results
Step 4: Read and Interpret Test Set Results
Step 5: Read and Interpret Class Results
Step 6: Visualize and Interpret Class Rules
Read and Interpret Test Set Results
Figure 4.12 Test set instance
classification
4.6 Techniques for Generating
Rules
1.
2.
3.
4.
5.
Define the scope of the rules.
Choose the instances.
Set the minimum rule correctness.
Define the minimum rule coverage.
Choose an attribute significance value.
4.7 Instance Typicality
Typicality Scores
• Identify prototypical and outlier instances.
• Select a best set of training instances.
• Used to compute individual instance
classification confidence scores.
Figure 4.13 Instance typicality
4.8 Special Considerations and
Features
• Avoid Mining Delays
• The Quick Mine Feature
• Erroneous and Missing Data