Mining Class Comparisons

Download Report

Transcript Mining Class Comparisons

Concept Description and
Data Generalization
(baseado nos slides do livro: Data
Mining: C & T)
Two categories of data
mining
Descriptive mining: describes concepts or taskrelevant data sets in concise, summarative,
informative, discriminative forms
Predictive mining: Based on data and analysis,
constructs models for the database, and predicts
the trend and properties of unknown data
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
What is Concept Description?
Concept description (or class description):
generates descriptions for characterization and
comparison of data

Characterization: provides a concise and succinct
summarization of the given collection of data

Class comparison (or discrimination): provides
descriptions comparing two or more collections of
data
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Data Generalization

A process which abstracts a large set of taskrelevant data in a database from a low
conceptual levels to higher ones.
1
2
3
4
5
Conceptual levels
Approaches:
• Data cube approach(OLAP approach)
• Attribute-oriented induction approach
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Concept Description
vs OLAP
Similarities:

Data generalization

Presentation of data summarization at multiple levels of abstraction.

Interactive drilling, pivoting, slicing and dicing.
Differences:

Complex data types of the attributes and their aggregations

Automated process to find relevant attributes and generalization degree

Dimension relevance analysis and ranking when there are many relevant
dimensions.
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Attribute-Oriented Induction

Proposed in 1989 (KDD ‘89 workshop)

Not confined to categorical data nor particular measures.

How it is done?

Collect the task-relevant data (initial relation) using a relational
database query

Perform data generalization by attribute removal or attribute
generalization, based on the nb. of distinct values of each
attribute.

Apply aggregation by merging identical, generalized tuples and
accumulating their respective counts

Interactive presentation with users
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Basic Principles (1)
Data focusing: task-relevant data, including dimensions,
and the result is the initial (working) relation.
Attribute-removal: remove attribute A if there is a large
set of distinct values for A but:
(1) there is no generalization operator on A, or
(2) A’s higher level concepts are expressed in terms of other
attributes.
Attribute-generalization: If there is a large set of distinct
values for A, and there exists a set of generalization
operators on A, then select an operator and generalize
Sistemas de Apoio à Decisão
(LEIC Tagus)
2003/04
A.
Basic Principles (2)

Two methods to control a generalization process:
Attribute-threshold control: typical 2-8, specified/default

if the number of distinct values in an attribute is greater than
the att. threshold, then removal or generalization applies
Generalized relation threshold control: sets a threshold
for the generalized (final) relation/rule size

If the number of distinct tuples in the generalized relation is
greater than the threshold, then further generalization
applies
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Basic Principles (3)
Acummulate count or other aggregate values : to
provide statistical information about the data at diff.
levels of abstraction
Ex: Count value for a tuple in the initial relation is 1,
When generalizing data, n tuples in the initial relation
result in groups of identical tuples merged into a single
generalized tuple (count is n)
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Basic Algorithm
1.
InitialRel: Query processing of task-relevant data, deriving
the initial relation.
2.
PreGen: Based on the analysis of the number of distinct
values in each attribute, determine generalization plan for
each attribute: removal? or how high to generalize?
3.
PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
4.
Presentation: User interaction: (1) adjust levels by drilling,
(2) pivoting, (3) mapping into rules, cross tabs,
visualization presentations.
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Class Characterization:
Example (1)

Describe general characteristics of graduate students
in the Big-University database (in DMQL)
use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place, birth_date,
residence, phone#, gpa
from student
where status in “graduate”

Corresponding SQL statement:
select name, gender, major, birth_place, birth_date, residence, phone#,
gpa
from student
where status in {“Msc”, “MBA”, “PhD” }
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Class Characterization: An
Example (2)
Initial Relation
Name
Gender
Jim
Woodman
Scott
Lachance
Laura Lee
…
F
…
Removed
Retained
2003/04
M
M
Major
Birth-Place
Birth_date
Residence
Phone #
GPA
Vancouver,BC, 8-12-76
Canada
CS
Montreal, Que, 28-7-75
Canada
Physics Seattle, WA, USA 25-8-70
…
…
…
3511 Main St.,
Richmond
345 1st Ave.,
Richmond
687-4598
3.67
253-9106
3.70
125 Austin Ave.,
Burnaby
…
420-5232
…
3.83
…
Sci,Eng,
Bus
City
Removed
Excl,
VG,..
CS
Country
Age range
Sistemas de Apoio à Decisão
(LEIC Tagus)
Class Characterization: An
Example (3)
Prime
Generalized
Relation
Gender
M
F
…
Cross-tab
2003/04
Major
Birth_region Age_range Residence
Science
Canada
20-25
Richmond
Science
Foreign
25-30
Burnaby
…
…
…
…
GPA
Count
Very-good
16
Excellent
22
…
…
Birth_Region
Canada
Foreign
Total
Gender
M
16
14
30
F
10
22
32
Total
26
36
62
Sistemas de Apoio à Decisão
(LEIC Tagus)
Presentation of
Generalized Results (1)
Generalized relation:

Relations where some or all attributes are
generalized, with counts or other aggregation
values accumulated.
Cross tabulation:

Mapping results into cross tabulation form
(similar to contingency tables).

Visualization techniques: Pie charts, bar charts,
curves, cubes, and other visual forms.
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Presentation—
Generalized Relation
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Presentation—Crosstab
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Presentation of Generalized
Results (2)

A generalized relation may also be represented in the
form of logic rules
Cj = target class
qa = a generalized tuple describing the target class
t-weight for qa: percentage of tuples of the target class
from the initial working class that are covered by qa
range: [0, 1]
t  weight 
count(q a )
n
 count(q
i 1
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
i
)
Presentation of Generalized
Results (3)
Quantitative characteristic rules: Mapping generalized
result into characteristic rules with quantitative
information associated with it
grad( x)  male( x) 
birth_ region( x) "Canada"[t :53%] birth_ region( x) " foreign"[t : 47%].

The disjunction of the conditions forms a necessary
condition of the target class, i.e., all tuples of the target
class must satisfy the condition

Not a sufficient condition of the target class, since a
tuple satisfying the same condition could belong to
another class
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Attribute Relevance
Analysis (1)

Why?





Which dimensions should be included?
How high level of generalization?
Automatic vs. interactive
Reduce # attributes; easy to understand patterns
What?

statistical method for preprocessing data




2003/04
filter out irrelevant or weakly relevant attributes
retain or rank the relevant attributes
relevance related to dimensions and levels
analytical characterization, analytical comparison
Sistemas de Apoio à Decisão
(LEIC Tagus)
Attribute relevance
analysis (2)

How?
1.
2.
3.
Data Collection
Preliminary relevance analysis using conservative
AOI
Analytical Generalization


4.
Attribute-oriented Induction for class description

2003/04
Use information gain analysis (e.g., entropy or other
measures) to identify highly relevant dimensions and
levels.
Sort and select the most relevant dimensions and levels.
Using a less conservative threshold for AOI
Sistemas de Apoio à Decisão
(LEIC Tagus)
Relevance Measures
Quantitative relevance measure:
determines the classifying power of an
attribute within a set of data.
Methods:
information gain (ID3)
 gain ratio (C4.5)
 gini index
 2 contingency table statistics
 uncertainty coefficient

2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Entropy and Information
Gain
S contains si tuples of class Ci for i = {1, …, m}
Entropy or expected information measures info
required to classify any arbitrary tuple
m
E ( S )  I( s1,s2,...,sm)  
i 1
si
si
log 2
s
s
Entropy of attribute A with values {a1,a2,…,av}
s1 j  ...  smj
I (s1 j,..., smj)
s
j 1
v
E(A)  
Information gained by branching on attribute A
Gain(A)  I(s 1, s 2 ,..., sm)  E(A)
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Example of Analytical
Characterization (1)
Task

Mine general characteristics describing graduate
students using analytical characterization
Given





2003/04
attributes name, gender, major, birth_place,
birth_date, phone#, and gpa
Gen(ai) = concept hierarchies on ai
Ui = attribute analytical thresholds for ai
Ti = attribute generalization thresholds for ai
R = attribute relevance threshold
Sistemas de Apoio à Decisão
(LEIC Tagus)
Example of Analytical
Characterization (2)
1.
Data collection


2.
target class: graduate student
contrasting class: undergraduate student
Analytical generalization using Ui

attribute removal


attribute generalization



2003/04
remove name and phone#
generalize major, birth_place, birth_date and gpa
accumulate counts
candidate relation: gender, major, birth_country,
age_range and gpa
Sistemas de Apoio à Decisão
(LEIC Tagus)
Example: Analytical
characterization (3)
gender
major
birth_country
age_range gpa
count
M
F
M
F
M
F
Science
Science
Engineering
Science
Science
Engineering
Canada
Foreign
Foreign
Foreign
Canada
Canada
20-25
25-30
25-30
25-30
20-25
20-25
16
22
18
25
21
18
Very_good
Excellent
Excellent
Excellent
Excellent
Excellent
Candidate relation for Target class: Graduate students (=120)
gender
major
birth_country
age_range gpa
count
M
F
M
F
M
F
Science
Business
Business
Science
Engineering
Engineering
Foreign
Canada
Canada
Canada
Foreign
Canada
<20
<20
<20
20-25
20-25
<20
18
20
22
24
22
24
Very_good
Fair
Fair
Fair
Very_good
Excellent
Candidate relation for Contrasting class: Undergraduate students (=130)
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Example: Analytical
characterization (4)

3. Relevance analysis

Calculate expected info required to classify an
arbitrary tuple
I(s 1, s 2 )  I( 120,130 )  

120
120 130
130
log 2

log 2
 0.9988
250
250 250
250
Calculate entropy of each attribute: e.g. major
For major=”Science”:
S11=84
For major=”Engineering”:S12=36
For major=”Business”:
S13=0
Number of grad students
in “Science”
2003/04
S21=42
S22=46
S23=42
I(s11,s21)=0.9183
I(s12,s22)=0.9892
I(s13,s23)=0
Number of undergrad students
in “Science”
Sistemas de Apoio à Decisão
(LEIC Tagus)
Example: Analytical
Characterization (5)

Calculate expected info required to classify a
given sample if S is partitioned according to the
attribute
E(major) 
126
82
42
I ( s11, s 21 ) 
I ( s12, s 22 ) 
I ( s13, s 23 )  0.7873
250
250
250
Gain(major)  I(s 1, s 2 )  E(major)  0.2115

Calculate information gain for each attribute

2003/04
Information gain for all attributes
Sistemas de Apoio à Decisão
(LEIC Tagus)
Gain(gender)
= 0.0003
Gain(birth_country)
= 0.0407
Gain(major)
Gain(gpa)
= 0.2115
= 0.4490
Gain(age_range)
= 0.5971
Example: Analytical
characterization (5)
4.
Initial working relation (W0) derivation



R = 0.1
remove irrelevant/weakly relevant attributes from
candidate relation => drop gender, birth_country
remove contrasting class candidate relation
major
Science
Science
Science
Engineering
Engineering
age_range
20-25
25-30
20-25
20-25
25-30
gpa
Very_good
Excellent
Excellent
Excellent
Excellent
count
16
47
21
18
18
Initial target class working relation W0: Graduate students
5.
2003/04
Perform attribute-oriented induction on W0
using Ti
Sistemas de Apoio à Decisão
(LEIC Tagus)
Mining Class Comparisons

Comparison: Comparing two or more classes

Method:

Partition the set of relevant data into the target class and the
contrasting class(es)

Generalize both classes to the same high level concepts

Compare tuples with the same high level descriptions

Present for every tuple its description and two measures



support - distribution within single class

comparison - distribution between classes
Highlight the tuples with strong discriminant features
Relevance Analysis:

2003/04
Find attributes (features) which best distinguish different classes
Sistemas de Apoio à Decisão
(LEIC Tagus)
Quantitative Discriminant
Rules


Cj = target class
qa = a generalized tuple covers some tuples of
target class


but can also cover some tuples of contrasting class
d-weight

range: [0, 1]
d  weight 
count(qa  Cj )
m
 count(q
a
 Ci )
i 1

quantitative discriminant rule form
 X, target_class(X)  condition(X) [d : d_weight]
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Example (1)

Compare the general properties between the graduate
students and the undergraduate students at the BigUniversity database, given the attributes: name,
gender, etc (in DMQL)
use Big_University_DB
mine comparison as “Grad-vs-Undergrad”
in relevance to name, gender, major, birth_place, birth_date,
residence, phone#, gpa
from “graduate_students”
where status in “graduate”
versus “undergraduate_students”
where status in “undergraduate”
analyze count%
from student
Sistemas de Apoio à Decisão
2003/04
(LEIC Tagus)
Example (2)
Count distribution between graduate and undergraduate students for a generalized tuple

Status
Birth_country Age_range Gpa
Count
Graduate
Canada
25-30
Good 90
Undergraduate Canada
25-30
Good 210
Quantitative discriminant rule
X , graduate _ student ( X ) 
birth _ country( X ) " Canada" age _ range( X ) "25  30" gpa( X ) " good " [d : 30%]

2003/04
where 90/(90+210) = 30%
Sistemas de Apoio à Decisão
(LEIC Tagus)
Class Description

Quantitative characteristic rule
 X, target_class(X)  condition(X) [t : t_weight]


necessary
Quantitative discriminant rule
 X, target_class(X)  condition(X) [d : d_weight]


sufficient
Quantitative description rule
 X, target_class(X) 
condition 1(X) [t : w1, d : w 1]  ...  condition n(X) [t : wn, d : w n]

2003/04
necessary and sufficient
Sistemas de Apoio à Decisão
(LEIC Tagus)
Example: Quantitative
Description Rule
Loc./item
TV
Computer
Both_items
Count t-wt
d-wt
Count t-wt
d-wt
Count t-wt
d-wt
Europe
80
25%
40%
240
75%
30%
320
100%
32%
N_Am
120
17.65% 60%
560
82.35%
70%
680
100%
68%
Both_
regions
200
20%
80%
100% 1000
100%
100%
100% 800
Crosstab showing associated t-weight, d-weight values and total number
(in thousands) of TVs and computers sold at AllElectronics in 1998

Quantitative description rule for target class
Europe
 X, Europe(X) 
(item(X) " TV" ) [t : 25%, d : 40%]  (item(X) " computer" ) [t : 75%, d : 30%]
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Bibliografia


(Livro) Data Mining: Concepts and
Techniques, J. Han & M. Kamber, Morgan
Kaufmann, 2001 (Capítulo 5 – livro 2001,
Secção 3.7 – draft)
(Livro) Machine Learning, T. Mitchell,
McGraw-Hill, 1997 (Secção 3.4)
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Information-Theoretic
Approach

Decision tree




each internal node tests an attribute
each branch corresponds to an attribute value
each leaf node assigns a classification
ID3 algorithm



build decision tree based on training objects with
known class labels to classify testing objects
rank attributes with information gain measure
minimal height

2003/04
the least number of tests to classify an object
Sistemas de Apoio à Decisão
(LEIC Tagus)
Top-Down Induction of
Decision Tree
Attributes = {Outlook, Temperature, Humidity, Wind}
PlayTennis = {yes, no}
Outlook
sunny
Humidity
high
no
2003/04
rain
overcast
Wind
yes
normal
yes
strong
no
Sistemas de Apoio à Decisão
(LEIC Tagus)
weak
yes