Efficient Rule-Based Attribute-Oriented Induction for Data Mining

Download Report

Transcript Efficient Rule-Based Attribute-Oriented Induction for Data Mining

Efficient Rule-Based
Attribute-Oriented Induction
for Data Mining
Authors: Cheung et al.
Graduate: Yu-Wei Su
Advisor: Dr. Hsu
Outline











Motivation
Objective
Introduction
Basic attribute-oriented induction
Rule-based concept generalization
Rule-based attribute-oriented induction
Path relation
An efficient rule-based AOI
Performance study
Conclusion
Opinion
Motivation
AOI induction capability is limited by the
unconditional concept generalization
 Basic AOI has the problem of time
complexity

Objective
Extending the concept generalization to
rule-based concept hierarchy
 Developing an efficient algorithm to
facilitate induction

Introduction
The growth in size of existing databases
has far exceeded the human abilities to
analyze
 AOI has been implemented in DBMiner
 The engine for concept generalization in
basic AOI is the concept ascension
 To further enhance the capability of AOI,
need to replace concept ascension

Introduction( cont.)
Rule-based Concept Graph can be
generalized to more than one higher level
concept
 Induction anomaly would occur, when a
concept tree is applied directly to a
concept graph
 Instead multi-dimensional data cube to
generalized relation can improve
performance

Basic attribute-oriented induction
The purpose of AOI is to discover rules
from relations
 The primary is concept generalization

Basic attribute-oriented induction( cont.)
By merging the records which have the
same generalized values, characteristics of
the data can be captured
 Primitives in AOI


Task-relevant data


Data that want to be analyzed, called initial relation
Background knowledge

To support concept hierarchies
Basic attribute-oriented induction( cont.)


Representation of learning results
Concept generalization

Notation



Desirable level: an attribute contain a small number
of distinct values
Desirable attribute threshold: to control the number
of distinct values of an attribute
Minimum desirable level: when generalized to a level
lower than the current one, the distinct number would
more than desirable attribute threshold
Basic attribute-oriented induction( cont.)

1.
2.
Prime relation: every attribute is at minimum
desirable level R’
Inducing the prime relation R’ from initial
relation
Progressive generalization
Basic attribute-oriented induction( cont.)
Rule-based concept generalization
Rule-based concept generalization is a
more general scheme and a concept can
ascend via more than one path
 Generalization rule can be determine
whether a concept can be generalized
along one path

Rule-based concept generalization( cont.)

Deductive rule generalization


The rule associated with a generalization path
Concept hierarchy associated with deduction
generalization rules is called deduction-rulebased concept graph
Rule-based attribute-oriented induction

General model definition (DB, CH, DS, KR,
t a)






DB: underlying database
CH: a set of rule-based concept hierarchies
DS: deduction system supporting CH
KR: representation scheme of learned result
Ta: desirable attribute threshold
The generalization and rule creation
process is fundamentally the same as
basic AOI
Rule-based attribute-oriented induction( cont.)

Concept ascension require additional
information and it may not available in the
prime relation, called Induction anomaly



An depending attribute has been removed
An depending attribute can be generalized too
high to match the condition
An depending condition can only be evaluated
against the initial relation
Rule-based attribute-oriented induction( cont.)
Path relation
Further generalization in the rule-based
case has to be started again from the
initial relation and it is costly and wasteful
 Using a path relation to capture the
generalization result from one of the rules
on the initial relation
 A tuple in initial relation can only be
generalized via a unique path to the root

Path relation( cont.)
Before induction starts, a preprocessing is
used to identify and label the
generalization paths of all the attributes
 Every tuple can be transformed into a
tuple of ids of the associated
generalization paths
 Path relation has captured completely the
generalization result of the initial relation

Path relation( cont.)
Path relation( cont.)

Another issue in rule-based
generalization is the cyclic dependency

Generalization is depended on another
attribute, if dependency is cyclic, the
deadlock will happened
Path relation( cont.)
Path relation( cont.)

Data structure for generalization
An efficient rule-based AOI
An efficient rule-based AOI( cont.)
Performance study
The time complexity of relation path
algorithm is O(n) and backtracking
algorithm is O(n log n)
 The difference between two algorithms



Backtracking uses generalized relation as data
structure
Further generalization after prime relation has
been generated is based in the prime relation
Performance study( cont.)
Experiment data is a synthesized student
data
 Attribute name: name, status, sex, age,
GPA
 Conditions





Graduate students are at least 22 years
Ph.D. student are at least 25 years
Graduate student’s GPA are at least 3
Generalization order
Performance study( cont.)
Performance study( cont.)
Conclusion
Concept hierarchies may contain
unconditional and conditional rules, which
enlarges the application domain
 Path relation algorithm has an improved
complexity of O(n)
 Rule-based induction can extend the
capability of DBMiner

Opinion
The numerical attribute problem still exist
 Experiments are to few and can’t see the
improved results
