Transcript sdm10rdt

Multi-label Classification without Multi-label Cost
- Multi-label Random Decision Tree Classifier
1. IBM Research – China
2. IBM T.J.Watson Research Center
Presenter: Xiatian Zhang
[email protected]
Authors: Xiatian Zhang, Quan Yuan, Shiwan Zhao,
Wei Fan, Wentao Zheng, Zhong Wang
Multi-label Classification
 Classical Classification
(Single Label Classification)
Tree
Winter
– The classes are exclusive: if
an example belongs to one
class, it can’t be belongs to
others
 Multi-label Classification
– A picture, video, article may
belong to several compatible
categories
– A pieces of gene can control
several biological functions
Park
Ice
Lake
Existed Multi-label Classification Methods
 Grigorios Tsoumakas et al[2007] summarize the
existing methods for ML-Classification
 Two Strategies
– Problem Transformation
– Transfer Multi-label Classification Problem to Single
Classification Problem
– Algorithm Adaptation
– Adapt Single-label Classifiers to Solve the Multi-label
Classification Problem
– With high complexity
Problem Transformation Approaches
Classifier
 Label Powerset (LP)
– Label Powerset considers
each unique subset of
labels that exists in the
multi-label dataset as a
single label
L1
L2
L1
L2
L1
L3
L1
L2
L3
L3
L3
L4
 Binary Relevance (BR)
– Binary Relevance learns
one binary classier for each
label
L1+
L2+
L3+
L1-
L2-
L3-
Classifier1
Classifier2
Classifier3
Large Number of Labels Problem
 Hundreds and even more labels
– Text categorization
– protein function classification
– semantic annotation of multimedia
 The Impacts to Multi-label Classification Methods
– Label Powerset: the number of training examples for each
particular label will be much less
– Binary Relevance: The computational complexity is with linear
complexity with respect to the number of labels
– Algorithm Adaptation: Even more worse than Binary Relevance
HOMER for Large Number of Labels Problem
 HOMER (Hierarchy Of Multilabel classifERs) is
developed by Grigorios Tsoumakas et al, 2008.
 The HOMER algorithm constructs a Hierarchy Of
Mul-tilabel classifERs, each one dealing with a
much smaller set of labels.
Our Method – Without Label Cost
 Without Label Cost
– Training Time is almost irrelevant with number of labels |L|
 But with Reliable Quality
– The classification Quality can be compared to mainstream
methods over different data sets.
 How to make it?
Our Method – Without Label Cost cont.
 Binary Relevance Method based on Random
Decision Tree
 Random Decision Tree [Fan et al, 2003]
– Training Process is irrelevant with label information
– Random Construction with very low cost
– Stable quality on many applications
Random Decision Tree – Tree Construction
 At each node, an un-used feature is chosen randomly
– A discrete feature is un-used if it has never been chosen
previously on a given decision path starting from the root to the
current node.
– A continuous feature can be chosen multiple times on the same
decision path, but each time a different threshold value is chosen
 It stop when one of the following happens:
– A node becomes too small (<= 4 examples).
– Or the total height of the tree exceeds some limits:
– Such as the total number of features.
 The construction process is irrelevant with
label information
Random Decision Tree - Node Statistics
 Classification and
Probability Estimation:
– Each node of the tree
keeps the number of
examples belonging to
each class.
F1<0.5
Y
F2>0.7
Y
+:200
-: 10
 The node statistics
process cost a little
computation resource
N
F3>0.3
N
N
……
+:30
-: 70
Random Decision Tree - Classification
 During classification, each tree outputs posterior probability:
F1<0.5
Y
N
F2>0.7
Y
+:200
-: 10
F3>0.3
N
N
……
+:30
-: 70
P(+|x)=30/100
=0.3
Random Decision Tree - Ensemble
 For a instance x, average the estimated probability on each
tree and take the average probability as the predicted
probability of x.
F1<0.5
F3>0.3
Y
N
F2>0.7
Y
+:200
-: 10
Y
F2<0.6
F3>0.3
N
Y
N
……
N
+:100
-:120
+:30
-: 70
F1>0.7
N
N
……
P(+|x)=30/100=0.3
(P(+|x)+P’(+|x))/2 = 0.45
+:30
-: 20
P’(+|x)=30/50 =0.6
Multi-label Random Decision Tree
F1<0.5
Y
N
F2>0.7
Y
L1+:200 L2+:40
L1-: 10 L2-: 60
F3>0.5
Y
F2<0.7
F3>0.3
N
……
Y
N
L1+:30
L1-: 70
N
L2+:50
L2-: 50
P(L1+|x)=30/100=0.3
P(L2+|x)=50/100=0.5
L1+:100 L1+:200
L1-:120 L1-: 10
F1>0.7
N
N
……
L1+:30
L1-: 20
L2+:20
L2-: 80
P’(L1+|x)=30/50 =0.6
P’(L2+|x)=20/100=0.2
(P(L1+|x)+P’(L1+|x))/2 = 0.45
(P(L2+|x)+P’(L2+|x))/2 = 0.35
Why RDT Works?
 Ensemble Learning
View
– Our Analysis
– Other Explanations
 Non-Parametric
Estimation
Complexity of Multi-label Random Decision Tree
 Training Complexity:
– m is the number of trees, and n is the number of instances
– t is the average number of labels on each leaf nodes, t<<n, and
t<<|L|.
– It is irrelevant with number of labels |L|.
– Complexity of C4.5:
Vi
is the size of values of i-th attribute.
– Complexity of HOMER:
 Test Complexity:
– q is the average depth of branches of trees
– It is also irrelevant with number of labels |L|
Experiment – Metrics and Datasets
 Quality Metrics:
 Datasets:
Experiment - Quality
Experiment – Computational Cost
Experiment – Computational Cost cont.
Experiment – Computational Cost cont
Future Works
 Leverage the relationship of labels.
 Apply ML-RDT for Recommendation
 Parallelization and Streaming Implementation