Mining Multi-label Data by Grigorios Tsoumakas, Ioannis Katakis

Download Report

Transcript Mining Multi-label Data by Grigorios Tsoumakas, Ioannis Katakis

MINING MULTI-LABEL DATA
BY GRIGORIOS TSOUMAKAS, IOANNIS
KATAKIS, AND IOANNIS VLAHAVAS
Published on July, 7, 2010
Team Members: Kristopher Tadlock, Jimmy Sit, Kevin Mack
BACKGROUND AND PROBLEM
DEFINITION
• “A large body of research in supervised learning deals with the analysis of
single label data, where training examples are associated with a single label
l from a set of disjoint labels L. However, training examples in several
application domains are often associated with a set of labels Y (union) L.
Such data are called multi-label” (Tsoumakas et al).
• Applications in ranking web pages. Web pages are often multi labeled. For
example “cooking” and “food network” and “iron chef” might all apply to
the same page. How do you rank and classify that along other pages that
have some of the same labels, but not all of the same labels?
TECHNICAL HIGHLIGHTS OF
PROBLEM SOLVING
• Problem Transformation: divide the problem into several single label
problems and solve them using known algorithms.
• Algorithm Adaption: Change an existing algorithm so you can use it
on a multi label problem.
• Dimensionality Reduction: Reduce the number of random variables in the
data set or reduce the number of dimensions in the labels. The goal here to
remove white noise so you can focus on the relationships that matter or
concern you.
• Evaluation Measures: How good of a job did you do? How accurately does
your model classify examples?
• Ex. How often are labels miss classified? How often does a less important
label get a higher rank than a more important label?
ILLUSTRATION OF METHODS
INTRODUCED
• Problem Transformation: divide the problem into several single
label problems and solve them using known algorithms.
• Label Powerset - Treats multi labels as if they are a single label,
and then ranks them according to highest support. Similar to first
step of Apori.
• Binary Relevance – Assigns a + or - classifier to each label. If an
instance has that label it gets a +, if not it gets a -. Similar to 1
rule.
• Algorithm Adaption: Change an existing algorithm so you can
use it on a multi label problem.
• Adapted C4.5 Tree – Multi labels are leaves, and multi labels are
ordered so as to reduce entropy
•
where p(l j) = relative frequency of class l j and q(l j) = 1¡ p(l j).
• ML – KNN: Same as regular KNN, choose x nearest neighbors,
and then use an aggregation algorithm like ML. ML uses the
posteriror principle, which is concerned with what can be known
about the data set without learning (prior probabilities) and after
learning (posterior probabilities).
DIMENSIONALITY REDUCTION
TECHNIQUES
• Feature Selection: Select a subset of the dimensions for some purpose. Ex. To
minimize a loss function
• Wrapper – A guided search of feature set. Select based on some criteria
• Filter – Look for something specific in the data set. Ex. An informed search
based on the result of LP learning.
• Feature Extraction: Transform the data set into a lower dimensional data set
using some algorithm or reasoning. Uses various statistical and linear algebra
techniques.
• Exploiting Label Structure: Create a general to specific tree. An example
cannot be associated with a label (some leaf node) without being
associated with its parent nodes. Create a relationship be tracing the path
from root to leaf.
SCALING UP PROBLEMS
If you are analyzing a complicated data set with
thousands of labels you can run into problems.
• Number of training examples is significantly more than
the number of actual examples: far more
combinations of labels and classifiers than actual
labels and classifiers. Output is to complex and not
helpful.
• Problem complexity and performance: It can take too
much memory and/or processing time to classify
everything.
• One technique for reducing complexity is to use a
hierarchy tree, such as HOMER (hierarchy of multi
label classifiers). Each layer is a subset of the labeled
data set. Uses predictive algorithms to decide how to
divide up the data set as you descend down the tree.
MULTI LABEL DATA MINING
SOFTWARE
BoosTexter
Matlab
Mulan
TEAM’S OPINION ON THE
METHOD/RESEARCH WORK
• Thorough coverage of the spectrum of techniques and considerations in
multi label data mining.
• Great place to discover new techniques and algorithms.
• Didn’t go into very much detail into any one algorithm. Seemed best suited
as a resource or an introduction.
• Didn’t compare or contrast methods. Wasn’t sure when to use problem
transformation, algorithm adaption, or reduction.