Presentation

Download Report

Transcript Presentation

Scalable Mining For Classification
Rules in Relational Databases
Min Wang
Bala Iyer
Jeffrey Scott Vitter
‫ נדב גרוסאוג‬: ‫מוצג ע”י‬
Abstract
• Problem : Increase in Size of Training Set
• MIND (MINing in Database) Classifier
• Can be Implemented easily over SQL
• Other Classifiers Need O(N) space In Memory.
• MIND Scales Well Over :
• I/O
• # of Processors
Over View
• Introduction
• Algorithm
• Database Implementation
• Performance
• Experimental Results
• Conclusions
Introduction - Classification Problem
DETAIL TABLE
CLASSIFYER
Age <= 30
yes
no
salary <= 62K
yes
safe
no
risky
safe
Introduction - Scalability In
Classification
Importance Of Scalability:
• Use a Very Large Training Set – Data is Not
Memory Resident.
• Number Of CPUs – better usage of
resources.
Introduction - Scalability In
Classification
Properties of MIND:
• Scalable in memory
• Scalable In CPU
• Uses SQL
• Easy to implement
Assumptions
Attribute Values Are Discrete
We focus on the growth stage(no pruning)
The Algorithm - DataStracture
DATA in DETAIL TABLE
DETAIL(attr1,attr2,….,class,leaf_num)
attri = i attribute
class = Class type
leaf_num = the number of leaf the
example belongs to(this data can be
calculated by the known tree)
The Algorithm - gini index
S - data Set
C - number of Classes
Pi - relative frequency of class i in S
gini index :
The Algorithm
GrowTree(DETAIL TABLE)
Initialize tree T and put all records of DETAIL in root
while (some leaf in T is not a STOP node)
for each attribute i do
evaluate gini index for each non-STOP leaf at
each split value with respect to attribute i
for each non-STOP leaf do
get the overall best split for it;
partition the records and grow the tree for one more
level according to best splits;
mark all small or pure leaves as STOP nodes;
return T;
Database Implementation Dimension table
• For Each Attribute and each level of the tree
INSERT INTO DIMi
SELECT leaf_num,class,attri,count(*)
FROM DETAIL
WHERE leaf_num,<> STOP
GROUP BY leaf_num,class,attri
Size of Dimi = #leaves * #distinct values of
attri * #classes
Database Implementation Dimension table SQL
SELECT FROM DETAIL
INSERT INTO DIM1 leaf_num,class,attr1,count(*)
WHERE leaf_num,<> STOP
GROUP BY leaf_num,class,attr1
INSERT INTO DIM2 leaf_num,class,attr2,count(*)
WHERE leaf_num,<> STOP
GROUP BY leaf_num,class,attr2
•
•
•
Database Implementation UP/DOWN - split
for each attribute we find all possible split places:
INSERT INTO UP
SELECT d1.leaf_num, d1.attri,
d1.class,SUM(d2.count)
FROM(FULL OUTER JOIN DIMi d1, DIMi d2
ON d1.leaf_num = d2.leaf_num AND
d2. attri <= d1. attri
AND
d1.class = d2.class
GROUP BY d1.leaf_num, d1. attri, d1.class
Database Implementation - Class
View
create view for each class k and attribute i:
CREATE VIEW Ck_UP(leaf_num,attri,count)
SELECT leaf_num,attri,count
FROM UP
WHERE class = k
Database Implementation - GINI
VALUE
create view for all gini values:
CREATE VIEW GINI_VALUE(leaf_num,
attri,gini)AS
SELECT u1.leaf_num, u1.attri,ƒgini
FROM C1_UP u1,..,Cc_UP uc,C1_DOWN d1...
,Cc_DOWN dc
WHERE u1.attri = .. = uc. attri = .. = dc. attri
AND u1.leaf_num = .. = uc.leaf_num = .. =
dc.leaf_num
Database Implementation - MIN
GINI VALUE
create table for minimum gini values
for attribute i :
INSERT INTO MIN_GINI
SELECT leaf_num,i,attri,gini
FROM GINI_VALUE a
WHERE a.gini =
(SELECT MIN(gini)
FROM GINI_VALUE b
WHERE a.leaf_num = b.leaf_num
Database Implementation BEST SPLIT
create view over MIN_GINI for best split :
CREATE VIEW BEST_SPLIT
(leaf_num,attr_name,attr_value)
SELECT leaf_num, attr_name,attr_value
FROM MIN_GINI a
WHERE a.gini =
(SELECT MIN(gini)
FROM MIN_GINI b
WHERE a.leaf_num = b.leaf_num
Database Implementation Partitioning
Build new nodes by spliting old nodes according
to BEST_SPLIT values
Set correct node to recoreds:
Update leaf_node - is done by a function
No need to UPDATE data or DB
Performance
I/O cost of MIND:
I/O cost of SPRINT:
Experimental Results
Normalized time to
finish building the tree
Normalized time to build
the tree per example
Experimental Results
Normalized time to build
the tree per # of processor
Time to build tree
By Training Set Size
Conclusions
• MIND works over DB
• MIND works well because
– MIND rephrases the classification to a DB
problem
– MIND avoid UPDATES the DETAIL table
– Parallelism and Scaling Are achived by the use
of RDBMS
– MIND uses a user function to get the
performance gain in the DIMi creation.