Mining Class Comparisons
Download
Report
Transcript Mining Class Comparisons
Concept Description and
Data Generalization
(baseado nos slides do livro: Data
Mining: C & T)
Two categories of data
mining
Descriptive mining: describes concepts or taskrelevant data sets in concise, summarative,
informative, discriminative forms
Predictive mining: Based on data and analysis,
constructs models for the database, and predicts
the trend and properties of unknown data
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
What is Concept Description?
Concept description (or class description):
generates descriptions for characterization and
comparison of data
Characterization: provides a concise and succinct
summarization of the given collection of data
Class comparison (or discrimination): provides
descriptions comparing two or more collections of
data
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Data Generalization
A process which abstracts a large set of taskrelevant data in a database from a low
conceptual levels to higher ones.
1
2
3
4
5
Conceptual levels
Approaches:
• Data cube approach(OLAP approach)
• Attribute-oriented induction approach
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Concept Description
vs OLAP
Similarities:
Data generalization
Presentation of data summarization at multiple levels of abstraction.
Interactive drilling, pivoting, slicing and dicing.
Differences:
Complex data types of the attributes and their aggregations
Automated process to find relevant attributes and generalization degree
Dimension relevance analysis and ranking when there are many relevant
dimensions.
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Attribute-Oriented Induction
Proposed in 1989 (KDD ‘89 workshop)
Not confined to categorical data nor particular measures.
How it is done?
Collect the task-relevant data (initial relation) using a relational
database query
Perform data generalization by attribute removal or attribute
generalization, based on the nb. of distinct values of each
attribute.
Apply aggregation by merging identical, generalized tuples and
accumulating their respective counts
Interactive presentation with users
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Basic Principles (1)
Data focusing: task-relevant data, including dimensions,
and the result is the initial (working) relation.
Attribute-removal: remove attribute A if there is a large
set of distinct values for A but:
(1) there is no generalization operator on A, or
(2) A’s higher level concepts are expressed in terms of other
attributes.
Attribute-generalization: If there is a large set of distinct
values for A, and there exists a set of generalization
operators on A, then select an operator and generalize
Sistemas de Apoio à Decisão
(LEIC Tagus)
2003/04
A.
Basic Principles (2)
Two methods to control a generalization process:
Attribute-threshold control: typical 2-8, specified/default
if the number of distinct values in an attribute is greater than
the att. threshold, then removal or generalization applies
Generalized relation threshold control: sets a threshold
for the generalized (final) relation/rule size
If the number of distinct tuples in the generalized relation is
greater than the threshold, then further generalization
applies
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Basic Principles (3)
Acummulate count or other aggregate values : to
provide statistical information about the data at diff.
levels of abstraction
Ex: Count value for a tuple in the initial relation is 1,
When generalizing data, n tuples in the initial relation
result in groups of identical tuples merged into a single
generalized tuple (count is n)
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Basic Algorithm
1.
InitialRel: Query processing of task-relevant data, deriving
the initial relation.
2.
PreGen: Based on the analysis of the number of distinct
values in each attribute, determine generalization plan for
each attribute: removal? or how high to generalize?
3.
PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
4.
Presentation: User interaction: (1) adjust levels by drilling,
(2) pivoting, (3) mapping into rules, cross tabs,
visualization presentations.
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Class Characterization:
Example (1)
Describe general characteristics of graduate students
in the Big-University database (in DMQL)
use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place, birth_date,
residence, phone#, gpa
from student
where status in “graduate”
Corresponding SQL statement:
select name, gender, major, birth_place, birth_date, residence, phone#,
gpa
from student
where status in {“Msc”, “MBA”, “PhD” }
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Class Characterization: An
Example (2)
Initial Relation
Name
Gender
Jim
Woodman
Scott
Lachance
Laura Lee
…
F
…
Removed
Retained
2003/04
M
M
Major
Birth-Place
Birth_date
Residence
Phone #
GPA
Vancouver,BC, 8-12-76
Canada
CS
Montreal, Que, 28-7-75
Canada
Physics Seattle, WA, USA 25-8-70
…
…
…
3511 Main St.,
Richmond
345 1st Ave.,
Richmond
687-4598
3.67
253-9106
3.70
125 Austin Ave.,
Burnaby
…
420-5232
…
3.83
…
Sci,Eng,
Bus
City
Removed
Excl,
VG,..
CS
Country
Age range
Sistemas de Apoio à Decisão
(LEIC Tagus)
Class Characterization: An
Example (3)
Prime
Generalized
Relation
Gender
M
F
…
Cross-tab
2003/04
Major
Birth_region Age_range Residence
Science
Canada
20-25
Richmond
Science
Foreign
25-30
Burnaby
…
…
…
…
GPA
Count
Very-good
16
Excellent
22
…
…
Birth_Region
Canada
Foreign
Total
Gender
M
16
14
30
F
10
22
32
Total
26
36
62
Sistemas de Apoio à Decisão
(LEIC Tagus)
Presentation of
Generalized Results (1)
Generalized relation:
Relations where some or all attributes are
generalized, with counts or other aggregation
values accumulated.
Cross tabulation:
Mapping results into cross tabulation form
(similar to contingency tables).
Visualization techniques: Pie charts, bar charts,
curves, cubes, and other visual forms.
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Presentation—
Generalized Relation
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Presentation—Crosstab
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Presentation of Generalized
Results (2)
A generalized relation may also be represented in the
form of logic rules
Cj = target class
qa = a generalized tuple describing the target class
t-weight for qa: percentage of tuples of the target class
from the initial working class that are covered by qa
range: [0, 1]
t weight
count(q a )
n
count(q
i 1
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
i
)
Presentation of Generalized
Results (3)
Quantitative characteristic rules: Mapping generalized
result into characteristic rules with quantitative
information associated with it
grad( x) male( x)
birth_ region( x) "Canada"[t :53%] birth_ region( x) " foreign"[t : 47%].
The disjunction of the conditions forms a necessary
condition of the target class, i.e., all tuples of the target
class must satisfy the condition
Not a sufficient condition of the target class, since a
tuple satisfying the same condition could belong to
another class
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Attribute Relevance
Analysis (1)
Why?
Which dimensions should be included?
How high level of generalization?
Automatic vs. interactive
Reduce # attributes; easy to understand patterns
What?
statistical method for preprocessing data
2003/04
filter out irrelevant or weakly relevant attributes
retain or rank the relevant attributes
relevance related to dimensions and levels
analytical characterization, analytical comparison
Sistemas de Apoio à Decisão
(LEIC Tagus)
Attribute relevance
analysis (2)
How?
1.
2.
3.
Data Collection
Preliminary relevance analysis using conservative
AOI
Analytical Generalization
4.
Attribute-oriented Induction for class description
2003/04
Use information gain analysis (e.g., entropy or other
measures) to identify highly relevant dimensions and
levels.
Sort and select the most relevant dimensions and levels.
Using a less conservative threshold for AOI
Sistemas de Apoio à Decisão
(LEIC Tagus)
Relevance Measures
Quantitative relevance measure:
determines the classifying power of an
attribute within a set of data.
Methods:
information gain (ID3)
gain ratio (C4.5)
gini index
2 contingency table statistics
uncertainty coefficient
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Entropy and Information
Gain
S contains si tuples of class Ci for i = {1, …, m}
Entropy or expected information measures info
required to classify any arbitrary tuple
m
E ( S ) I( s1,s2,...,sm)
i 1
si
si
log 2
s
s
Entropy of attribute A with values {a1,a2,…,av}
s1 j ... smj
I (s1 j,..., smj)
s
j 1
v
E(A)
Information gained by branching on attribute A
Gain(A) I(s 1, s 2 ,..., sm) E(A)
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Example of Analytical
Characterization (1)
Task
Mine general characteristics describing graduate
students using analytical characterization
Given
2003/04
attributes name, gender, major, birth_place,
birth_date, phone#, and gpa
Gen(ai) = concept hierarchies on ai
Ui = attribute analytical thresholds for ai
Ti = attribute generalization thresholds for ai
R = attribute relevance threshold
Sistemas de Apoio à Decisão
(LEIC Tagus)
Example of Analytical
Characterization (2)
1.
Data collection
2.
target class: graduate student
contrasting class: undergraduate student
Analytical generalization using Ui
attribute removal
attribute generalization
2003/04
remove name and phone#
generalize major, birth_place, birth_date and gpa
accumulate counts
candidate relation: gender, major, birth_country,
age_range and gpa
Sistemas de Apoio à Decisão
(LEIC Tagus)
Example: Analytical
characterization (3)
gender
major
birth_country
age_range gpa
count
M
F
M
F
M
F
Science
Science
Engineering
Science
Science
Engineering
Canada
Foreign
Foreign
Foreign
Canada
Canada
20-25
25-30
25-30
25-30
20-25
20-25
16
22
18
25
21
18
Very_good
Excellent
Excellent
Excellent
Excellent
Excellent
Candidate relation for Target class: Graduate students (=120)
gender
major
birth_country
age_range gpa
count
M
F
M
F
M
F
Science
Business
Business
Science
Engineering
Engineering
Foreign
Canada
Canada
Canada
Foreign
Canada
<20
<20
<20
20-25
20-25
<20
18
20
22
24
22
24
Very_good
Fair
Fair
Fair
Very_good
Excellent
Candidate relation for Contrasting class: Undergraduate students (=130)
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Example: Analytical
characterization (4)
3. Relevance analysis
Calculate expected info required to classify an
arbitrary tuple
I(s 1, s 2 ) I( 120,130 )
120
120 130
130
log 2
log 2
0.9988
250
250 250
250
Calculate entropy of each attribute: e.g. major
For major=”Science”:
S11=84
For major=”Engineering”:S12=36
For major=”Business”:
S13=0
Number of grad students
in “Science”
2003/04
S21=42
S22=46
S23=42
I(s11,s21)=0.9183
I(s12,s22)=0.9892
I(s13,s23)=0
Number of undergrad students
in “Science”
Sistemas de Apoio à Decisão
(LEIC Tagus)
Example: Analytical
Characterization (5)
Calculate expected info required to classify a
given sample if S is partitioned according to the
attribute
E(major)
126
82
42
I ( s11, s 21 )
I ( s12, s 22 )
I ( s13, s 23 ) 0.7873
250
250
250
Gain(major) I(s 1, s 2 ) E(major) 0.2115
Calculate information gain for each attribute
2003/04
Information gain for all attributes
Sistemas de Apoio à Decisão
(LEIC Tagus)
Gain(gender)
= 0.0003
Gain(birth_country)
= 0.0407
Gain(major)
Gain(gpa)
= 0.2115
= 0.4490
Gain(age_range)
= 0.5971
Example: Analytical
characterization (5)
4.
Initial working relation (W0) derivation
R = 0.1
remove irrelevant/weakly relevant attributes from
candidate relation => drop gender, birth_country
remove contrasting class candidate relation
major
Science
Science
Science
Engineering
Engineering
age_range
20-25
25-30
20-25
20-25
25-30
gpa
Very_good
Excellent
Excellent
Excellent
Excellent
count
16
47
21
18
18
Initial target class working relation W0: Graduate students
5.
2003/04
Perform attribute-oriented induction on W0
using Ti
Sistemas de Apoio à Decisão
(LEIC Tagus)
Mining Class Comparisons
Comparison: Comparing two or more classes
Method:
Partition the set of relevant data into the target class and the
contrasting class(es)
Generalize both classes to the same high level concepts
Compare tuples with the same high level descriptions
Present for every tuple its description and two measures
support - distribution within single class
comparison - distribution between classes
Highlight the tuples with strong discriminant features
Relevance Analysis:
2003/04
Find attributes (features) which best distinguish different classes
Sistemas de Apoio à Decisão
(LEIC Tagus)
Quantitative Discriminant
Rules
Cj = target class
qa = a generalized tuple covers some tuples of
target class
but can also cover some tuples of contrasting class
d-weight
range: [0, 1]
d weight
count(qa Cj )
m
count(q
a
Ci )
i 1
quantitative discriminant rule form
X, target_class(X) condition(X) [d : d_weight]
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Example (1)
Compare the general properties between the graduate
students and the undergraduate students at the BigUniversity database, given the attributes: name,
gender, etc (in DMQL)
use Big_University_DB
mine comparison as “Grad-vs-Undergrad”
in relevance to name, gender, major, birth_place, birth_date,
residence, phone#, gpa
from “graduate_students”
where status in “graduate”
versus “undergraduate_students”
where status in “undergraduate”
analyze count%
from student
Sistemas de Apoio à Decisão
2003/04
(LEIC Tagus)
Example (2)
Count distribution between graduate and undergraduate students for a generalized tuple
Status
Birth_country Age_range Gpa
Count
Graduate
Canada
25-30
Good 90
Undergraduate Canada
25-30
Good 210
Quantitative discriminant rule
X , graduate _ student ( X )
birth _ country( X ) " Canada" age _ range( X ) "25 30" gpa( X ) " good " [d : 30%]
2003/04
where 90/(90+210) = 30%
Sistemas de Apoio à Decisão
(LEIC Tagus)
Class Description
Quantitative characteristic rule
X, target_class(X) condition(X) [t : t_weight]
necessary
Quantitative discriminant rule
X, target_class(X) condition(X) [d : d_weight]
sufficient
Quantitative description rule
X, target_class(X)
condition 1(X) [t : w1, d : w 1] ... condition n(X) [t : wn, d : w n]
2003/04
necessary and sufficient
Sistemas de Apoio à Decisão
(LEIC Tagus)
Example: Quantitative
Description Rule
Loc./item
TV
Computer
Both_items
Count t-wt
d-wt
Count t-wt
d-wt
Count t-wt
d-wt
Europe
80
25%
40%
240
75%
30%
320
100%
32%
N_Am
120
17.65% 60%
560
82.35%
70%
680
100%
68%
Both_
regions
200
20%
80%
100% 1000
100%
100%
100% 800
Crosstab showing associated t-weight, d-weight values and total number
(in thousands) of TVs and computers sold at AllElectronics in 1998
Quantitative description rule for target class
Europe
X, Europe(X)
(item(X) " TV" ) [t : 25%, d : 40%] (item(X) " computer" ) [t : 75%, d : 30%]
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Bibliografia
(Livro) Data Mining: Concepts and
Techniques, J. Han & M. Kamber, Morgan
Kaufmann, 2001 (Capítulo 5 – livro 2001,
Secção 3.7 – draft)
(Livro) Machine Learning, T. Mitchell,
McGraw-Hill, 1997 (Secção 3.4)
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Information-Theoretic
Approach
Decision tree
each internal node tests an attribute
each branch corresponds to an attribute value
each leaf node assigns a classification
ID3 algorithm
build decision tree based on training objects with
known class labels to classify testing objects
rank attributes with information gain measure
minimal height
2003/04
the least number of tests to classify an object
Sistemas de Apoio à Decisão
(LEIC Tagus)
Top-Down Induction of
Decision Tree
Attributes = {Outlook, Temperature, Humidity, Wind}
PlayTennis = {yes, no}
Outlook
sunny
Humidity
high
no
2003/04
rain
overcast
Wind
yes
normal
yes
strong
no
Sistemas de Apoio à Decisão
(LEIC Tagus)
weak
yes