7. Decision Trees and Decision Rules

Download Report

Transcript 7. Decision Trees and Decision Rules

國立雲林科技大學
National Yunlin University of Science and Technology
Support Vector Machines Classification
with A Very Large-scale Taxonomy
Advisor :Dr. Hsu
Presenter: Chien-Shing Chen
Author: Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun
Zeng, Zheng Chen, and Wei-Ying Ma
SIGKDD , 2004
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline









Motivation
Objective
Introduction
Dataset characteristic
Complexity Analysis
Effectiveness Analysis
Experimental Settings
Conclusions
Personal Opinion
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation
very large-scale classification taxonomies
Hundreds of thousands of categories
Deep hierarchies
Skewed category distribution over documents
open question whether the state-of-the-art
technologies in text categorization
evaluation of SVM in web-page classification over
the full taxonomy of the Yahoo! categories
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective
1.
2.
3.
4.
scalability and effectiveness
a data analysis on the Yahoo! Taxonomy
development of a scalable system for large-scale
text categorization
theoretical analysis and experimental evaluation of
SVMs in hierarchical and non-hierarchical setting
for classification
threshold tuning algorithms with respect to time
complexity and accuracy of SVMs
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction
TC (Text categorization), SVMs, KNN, NB,…
in recent years, the scale of TC problems to become
larger and larger
Answer this question from the views of scalability
and effectiveness
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SVM
flat SVMs , hierarchical SVMs
structure of the taxonomy tree
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SVM
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SVM
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SVM
Optimal separating hyperplane between the two
classes by max the margin between the classes’
closest points
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SVM
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SVM
Multi-class classification
basically, SVMs can only solve binary classification
problems
fit all binary sub-classifiers
one-against-all
N two-class (true class and false class)
one-against-one
N(N-1)/2 classifiers
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SVM
be a set of n labeled training
documents
linear discriminant function
a corresponding classification function as
margin of a weight vector
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SVM
Optimal separation
soft-margin multiclass formulation
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SVM
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SVM
Intelligent Database Systems Lab
DATABASE-first characteristic
The full domain of the Yahoo! Directory
292,216 categories
792,601 documents
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
DATABASE-second characteristic
Over 76% of the Yahoo! Categories have fewer than 5 labeled
documents
As “rare categories” increases at deeper hierarchy levels
36% are rare categories at deep levels
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
DATABASE-third characteristic
many documents have multiple labels
average has 2.23 labels
the largest number of labels for a single document is 31
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
N.Y.U.S.T.
I. M.
DATABASE
Yahoo! Directory into a training set and a testing set with a
ratio of 7:3
Remove those categories containing only one labeled
document
132,199 categories
492,617 training documents 275,364 testing documents
Intelligent Database Systems Lab
Complexity and Effectiveness
Flat SVMs, with one-against-rest strategy
N is the number of training documents
M is the number of categories
denotes the average training time per SVM model
model
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Complexity and Effectiveness
Hierarchical
mi is the number of categories defined at the i-th level
j is the size-based rank of the categories
nij is the number of training documents for the j-th category
at the i-th level
ni1 is the number of training document for the most common
category at the i-th level
is a level-specific parameter
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Complexity and Effectiveness
was used to approximate the number of categories at
the i-th level
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Complexity and Effectiveness
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Complexity and Effectiveness
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Complexity and Effectiveness
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Complexity and Effectiveness
For the testing phase of hierarchical SVMs
Pachinko-machine search: 從根部做起,每次從當前類中選
擇一個最可能的子類打開,直到遇到葉子為止
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Complexity of SVM Classification
with Threshold Tuning
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Complexity of SVM Classification
with Threshold Tuning
SCut
Optimal performance of the classifier is obtained for the category
Fix the per-category thresholds when applying the classifier to new
documents in the test set
RCut
Sort categories by score and assign YES to each of the t topranking categories
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
N.Y.U.S.T.
I. M.
Effectiveness Analysis
Compared to scalability analysis, classification
effectiveness is not as clear and predictable
be affected by many other factors
Potential problems of SVM
noisy, imbalanced
Can’t expect the performance of hierarchical SVM to
be very good
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental Results
10 machines, each with four 3GHz CPUs and 4 GB
of memory
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental Results
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental Results
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental Results
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental Results
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Conclusions
Text categorization algorithms to very large problems,
especially large-scale Web taxonomies
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Conclusions
Drawback
Lower performance in deep level
Application
combine SVMs with concept hierarchical tree
Application to Text, or others domain
Pachinko-machine search…
Future Work
learn SVMs kernel to implement ?
Intelligent Database Systems Lab